I Thought I Understood Containers. Then I Tried Building One.

I Thought I Understood Containers. Then I Tried Building One.

我以为我懂容器了,直到我尝试自己构建一个

I had just aced my mentor’s Docker exam, so of course I thought I understood containers. I had said all the right words: namespaces, cgroups, images, layers, PID 1, Kubernetes Pods. Then I typed my first serious command and Linux reminded me that knowing the nouns is not the same thing as building the thing. 我刚刚通过了导师的 Docker 考试,所以理所当然地认为自己已经理解了容器。我能说出所有正确的术语:命名空间(namespaces)、控制组(cgroups)、镜像(images)、层(layers)、PID 1、Kubernetes Pods。然而,当我输入第一个正式命令时,Linux 提醒我:知道名词并不等同于亲手构建它。

$ sudo unshare -p 1 test unshare: failed to execute 1: No such file or directory That was the opening scene. I had not even built anything yet. I had typed the flags wrong and accidentally asked unshare to execute a program called 1. This was going to be less “implement Docker” and more “let the kernel correct my confidence, one error at a time.” 这就是开场白。我甚至还没开始构建任何东西。我输错了标志,无意中让 unshare 去执行一个名为“1”的程序。这与其说是“实现 Docker”,不如说是“让内核一次一个错误地修正我的自信”。

v1: namespaces, or the first time PID 1 lied to me

v1:命名空间,或者 PID 1 第一次欺骗我

The first version was supposed to be easy: run a process in a new PID namespace and prove it sees itself as PID 1. So I ran the command the way I thought it worked: 第一个版本应该很简单:在一个新的 PID 命名空间中运行一个进程,并证明它将自己视为 PID 1。于是我按照我以为正确的方式运行了命令:

$ sudo unshare —pid bash

echo $$

25184 That was not PID 1. That was just embarrassing. The rule I had missed is simple: PID namespaces apply to children. The process that calls unshare —pid does not magically become PID 1. You need to fork. The first child born into the new namespace becomes PID 1. So the working version was: 那不是 PID 1,这太尴尬了。我忽略的规则很简单:PID 命名空间适用于子进程。调用 unshare --pid 的进程不会神奇地变成 PID 1。你需要进行 fork。第一个诞生于新命名空间的子进程才会成为 PID 1。所以可行的版本是:

$ sudo unshare —pid —fork bash

echo $$

1 That one line changed the tone. I was inside a different process universe. The shell thought it was process 1. Signals felt different. Orphans came home to it. Then I ran ps, and got humbled again. 这一行代码改变了一切。我进入了一个不同的进程宇宙。Shell 认为自己是进程 1。信号的感觉不同了,孤儿进程也归它管了。接着我运行了 ps,再次被挫了锐气。

ps -o pid,ppid,comm

PID PPID COMMAND 25310 25304 bash 25344 25310 ps That made no sense at first. I was PID 1, but ps was showing host-looking PIDs. The next reveal: ps does not ask the kernel some pure “what processes exist?” question. It reads files. If /proc still points at the host procfs, your tools will tell you the host story. So I remounted /proc from inside the namespace: 起初这完全说不通。我是 PID 1,但 ps 显示的却是宿主机的 PID。接下来的发现是:ps 并不是向内核询问“存在哪些进程”这种纯粹的问题,它是在读取文件。如果 /proc 仍然指向宿主机的 procfs,你的工具就会告诉你宿主机的故事。所以我从命名空间内部重新挂载了 /proc

mount -t proc proc /proc

ps -o pid,ppid,comm

PID PPID COMMAND 1 0 bash 7 1 ps That was when it clicked. The namespace did not become real to my eyes until /proc agreed with it. Before that, I had isolation, but my tools were reading the old filesystem view. 那一刻我豁然开朗。直到 /proc 与命名空间达成一致,它对我来说才变得真实。在此之前,我虽然有了隔离,但我的工具读取的仍然是旧的文件系统视图。

The UTS namespace lesson was cleaner. I accidentally ran a science experiment. In one terminal, without a UTS namespace: UTS 命名空间的课程则更清晰。我无意中做了一个科学实验。在一个没有 UTS 命名空间的终端里:

$ hostname ba149abae9bd Then inside a new UTS namespace: 然后在新的 UTS 命名空间内:

$ sudo unshare —uts bash

hostname toybox

hostname

toybox Back outside that UTS namespace: 回到那个 UTS 命名空间之外:

$ hostname ba149abae9bd That was my control and treatment. Same machine, same kernel, different hostname view — and the “host” was already a container hostname, which made the containers-inside-containers setup visible in the output. Nothing mystical. Just one isolated kernel data structure doing exactly what the docs said, except now I had seen it with my own hands. 这就是我的对照组和实验组。同一台机器,同一个内核,不同的主机名视图——而且“宿主机”本身就是一个容器主机名,这使得“容器内运行容器”的设置在输出中清晰可见。没什么神秘的,只是一个隔离的内核数据结构在完全按照文档描述的那样工作,只不过现在我亲眼见证了这一切。

v2: pivot_root, the boss fight

v2:pivot_root,BOSS 战

After namespaces, I got overconfident again. The next version was supposed to give the process its own filesystem: a tiny rootfs, a shell, maybe BusyBox. Very container-ish. My repo had bash scripts for this, not some compiled runtime from a tutorial. So the shape of the attempt was v2.sh, a rootfs, and a command to run inside it. The parade started with the obvious error: 在搞定命名空间后,我又变得过度自信了。下一个版本的目标是给进程提供它自己的文件系统:一个微小的 rootfs,一个 shell,也许还有 BusyBox。非常有容器的感觉。我的仓库里有为此准备的 bash 脚本,而不是教程里那种编译好的运行时。所以这次尝试的结构是 v2.sh、一个 rootfs 以及在其中运行的命令。这一系列操作以一个显而易见的错误开始:

$ sudo ./v2.sh ./rootfs /bin/sh exec /bin/sh: no such file or directory Fine. There was no shell where I said there would be a shell. I fixed that with Alpine’s own BusyBox and hit the more annoying version: the file existed, but the kernel still said it could not run it. 好吧。我指定的地方并没有 shell。我用 Alpine 自带的 BusyBox 修复了这个问题,但随即遇到了更烦人的情况:文件存在,但内核仍然说无法运行它。

$ ./rootfs/bin/busybox sh bash: ./rootfs/bin/busybox: cannot execute: required file not found This is the kind of error that feels personal because you can list the file. You can see the symlink. The computer still refuses. The plot twist came from file: 这种错误让人感觉很针对,因为你可以列出该文件,可以看到符号链接,但计算机就是拒绝执行。转折点来自 file 命令:

$ cd rootfs $ file bin/busybox bin/busybox: ELF 64-bit LSB pie executable, ARM aarch64, dynamically linked, interpreter /lib/ld-musl-aarch64.so.1, stripped The binary was there. The interpreter was not available from the old world. Linux was not saying “your BusyBox file does not exist.” It was saying “from here, I cannot load the interpreter this ELF needs.” Same surface error, different problem. 二进制文件就在那里,但解释器在旧世界中不可用。Linux 并不是在说“你的 BusyBox 文件不存在”,而是在说“从这里开始,我无法加载这个 ELF 文件所需的解释器”。表面错误相同,但问题本质不同。

The fix was not what I first thought. Alpine’s BusyBox did not need to become static. Once Alpine became /, its musl loader would be at /lib/ld-musl-aarch64.so.1, and Alpine’s /bin/sh would be happy. The thing that needed help was the handoff itself: my Ubuntu slim image did not even have pivot_root. 修复方法并非我最初想的那样。Alpine 的 BusyBox 不需要变成静态链接。一旦 Alpine 成为根目录 /,它的 musl 加载器就会位于 /lib/ld-musl-aarch64.so.1,Alpine 的 /bin/sh 也就正常了。真正需要帮助的是交接过程本身:我的 Ubuntu 精简镜像甚至没有 pivot_root

$ pivot_root . put_old bash: pivot_root: command not found $ file /bin/busybox /bin/busybox: ELF 64-bit LSB executable, ARM aarch64, statically linked, stripped $ /bin/busybox pivot_root . put_old That was the better plot twist: the old world could not perform the handoff without borrowing a static tool. busybox-static was not my replacement shell inside Alpine. It was the bridge that could run before and during the transition. 这才是更精彩的转折:旧世界如果不借用一个静态工具,就无法完成交接。busybox-static 并不是我在 Alpine 内部的替代 shell,它是可以在转换前和转换期间运行的桥梁。

Then I hit the Bash hash-cache moment. Alpine was now /, but Bash still remembered a command path from before the filesystem switch. It went hunting for /usr/bin/mount in a world that had just been evicted. 接着我遇到了 Bash 哈希缓存的问题。Alpine 现在已经是根目录了,但 Bash 仍然记得文件系统切换前的命令路径。它试图在一个刚刚被驱逐的世界里寻找 /usr/bin/mount

/ # mount -t proc proc /proc bash: /usr/bin/mount: No such file or directory / # hash -r / # mount -t proc proc /proc I had fixed the filesystem and was still debugging an old decision Bash had remembered for me. That is the kind of bug that makes you take a short walk. 我修复了文件系统,却还在调试 Bash 为我记住的一个旧决定。这种 bug 真让人想出去散散步。

Then came the Mac problem. My setup was not “normal Linux laptop, local ext4 disk.” It was Apple Silicon Mac → privileged Ubuntu container → repo mounted from macOS. That means virtiofs was in the story whether I wanted it there or not. The symptom showed up after the pivot, inside Alpine, which made it stranger. Applet symlinks like mount and ls could fail with Permission denied on the Mac-shared mount, while calling BusyBox directly still worked. The files were there; executing through those symlinks was the weird part. 接下来是 Mac 的问题。我的环境不是“普通的 Linux 笔记本,本地 ext4 磁盘”,而是 Apple Silicon Mac → 特权 Ubuntu 容器 → 从 macOS 挂载的仓库。这意味着无论我是否愿意,virtiofs 都参与其中。症状出现在 pivot 之后,在 Alpine 内部,这让问题变得更奇怪。像 mountls 这样的 Applet 符号链接在 Mac 共享挂载点上会报“权限拒绝”,而直接调用 BusyBox 却能正常工作。文件明明就在那里,通过符号链接执行却成了怪事。

/ # ls sh: ls: Permission denied / # /bin/busybox ls bin dev etc lib proc put_old / # mount -t proc proc /proc sh: mount: Permission denied / # /bin/busybox mount -t proc proc /proc