Making devenv start fast, and the whole nixpkgs with it - devenv
Making devenv start fast, and the whole nixpkgs with it
Making devenv start fast, and the whole nixpkgs with it. I’m sitting here next to Farid Zakaria at Tacosprint where we looked at the stat storm that has been haunting nixpkgs for a decade. devenv auto activation runs devenv hook-should-activate on every shell prompt to decide whether you’ve stepped into a project directory. It does almost nothing: discover the project, check the trust database, print a path. So its runtime is pure startup overhead, and it runs on every single prompt redraw.
让 devenv 启动变快,进而让整个 nixpkgs 也变快。我正和 Farid Zakaria 在 Tacosprint 活动上,我们研究了困扰 nixpkgs 十年之久的“stat 风暴”。devenv 的自动激活功能会在每次 shell 提示符出现时运行 devenv hook-should-activate,以判断你是否进入了项目目录。它几乎什么都没做:发现项目、检查信任数据库、打印路径。因此,它的运行时间纯粹是启动开销,而且它在每次提示符重绘时都会运行。
$ time devenv hook-should-activate /home/domen/dev/myproject
real 0m0.070s
… 70ms before a prompt, every prompt. And this isn’t devenv’s tax to pay, it’s nixpkgs’. Every program pays it before it runs a line of its own code: the dynamic loader has to find each shared library, and the way Nix scatters packages across the store makes that search slow. This is not news. The cost has been measured, written up, and partly fixed more than once, and yet it has sat in limbo for the better part of a decade with no general fix merged into nixpkgs.
$ time devenv hook-should-activate /home/domen/dev/myproject
real 0m0.070s
……每次提示符出现前都要耗费 70 毫秒。这并不是 devenv 必须支付的代价,而是 nixpkgs 的代价。每个程序在运行自己的一行代码之前都要支付这个代价:动态加载器必须找到每个共享库,而 Nix 将软件包分散在存储库中的方式使得这种搜索变得缓慢。这并不是什么新闻。这种代价已经被测量、记录并多次部分修复,但它在过去十年里一直处于搁置状态,没有任何通用的修复方案被合并到 nixpkgs 中。
Most of that is the dynamic loader looking for a shared object that is sitting right there in the store, just not in the first directory it tried. The loader knocks on 486 wrong doors before it finds the right ones, and almost all of it happens before main even starts. That number is the whole game. Above ~30ms you have to bolt a caching layer on top of the hook; in single digit milliseconds you just run it on every prompt and throw the cache away. And it scales with the closure: imagemagick’s magick --version makes 1225 failing opens:
其中大部分开销是动态加载器在寻找一个明明就在存储库中、却不在它尝试的第一个目录里的共享对象。加载器在找到正确位置之前会敲开 486 扇错误的门,而这一切几乎都发生在 main 函数启动之前。这个数字就是问题的关键。如果超过约 30 毫秒,你就必须在钩子上加一层缓存;如果能在个位数毫秒内完成,你就可以在每次提示符出现时直接运行它,而无需缓存。而且它会随着闭包(closure)的大小而扩展:imagemagick 的 magick --version 会产生 1225 次失败的 open 调用:
$ strace -f -e openat magick --version 2>&1 >/dev/null | grep '\.so' | grep -c ENOENT
1225
The community has been circling a real fix for years. This post walks through the problem, the approaches people have tried with their tradeoffs, and a more radical one we spiked for devenv to see if it was even possible: deleting the dynamic loader altogether by linking the whole program into one static binary. The umbrella tracking issue for the general problem is NixOS/nixpkgs#481620.
社区多年来一直在寻找真正的解决方案。这篇文章探讨了这个问题、人们尝试过的方法及其权衡,以及我们为 devenv 进行的一次激进尝试,看看是否可行:通过将整个程序链接成一个静态二进制文件来彻底删除动态加载器。针对该通用问题的汇总跟踪议题是 NixOS/nixpkgs#481620。
Why Nix makes the loader work so hard
为什么 Nix 让加载器工作得如此辛苦
On a traditional distribution every shared library lives in a handful of global directories such as /usr/lib. The dynamic loader has a short, mostly cached search path, and ld.so.cache (built by ldconfig) turns soname lookups into a hash table hit. Nix is different by design. Every package lives in its own /nix/store/<hash>-name/lib directory, and there is no global ld.so.cache for store libraries. To make a binary find its dependencies, Nix records a DT_RUNPATH in the ELF header that lists one directory per dependency. A program linked against fifty libraries gets a DT_RUNPATH with dozens of entries.
在传统的发行版中,每个共享库都位于少数几个全局目录中,例如 /usr/lib。动态加载器有一个简短且大部分已缓存的搜索路径,而 ld.so.cache(由 ldconfig 构建)将 soname 查找转化为哈希表命中。Nix 的设计则不同。每个软件包都位于其自己的 /nix/store/<hash>-name/lib 目录中,并且没有针对存储库的全局 ld.so.cache。为了让二进制文件找到其依赖项,Nix 在 ELF 头中记录了一个 DT_RUNPATH,其中列出了每个依赖项所在的目录。一个链接了五十个库的程序会得到一个包含数十个条目的 DT_RUNPATH。
Now recall how glibc resolves a DT_NEEDED soname with DT_RUNPATH present: it walks every DT_RUNPATH directory in order, trying to open dir/soname in each, until one succeeds. So resolving N libraries against a path of M directories costs on the order of N times M openat() attempts, almost all of which fail. That is the stat storm. It gets worse. For every directory it searches, glibc first probes the glibc-hwcaps subdirectories for your CPU (x86-64-v3, x86-64-v2, and so on), which adds roughly three more failing opens per directory on a modern machine.
现在回想一下 glibc 是如何在存在 DT_RUNPATH 的情况下解析 DT_NEEDED soname 的:它按顺序遍历每个 DT_RUNPATH 目录,尝试在每个目录中打开 dir/soname,直到成功为止。因此,在 M 个目录的路径中解析 N 个库,其成本大约是 N 乘以 M 次 openat() 尝试,其中几乎所有尝试都会失败。这就是“stat 风暴”。情况还会更糟。对于它搜索的每个目录,glibc 首先会探测你 CPU 的 glibc-hwcaps 子目录(x86-64-v3, x86-64-v2 等),这在现代机器上每个目录大约会增加三次失败的 open 调用。
On a fast SSD with a warm cache none of this is noticeable. On a slow disk, a network filesystem, a cold cache, or a low power ARM board, it is the difference between snappy and sluggish, and it multiplies across every process a shell script spawns. Concretely, the two workloads we traced most closely:
在带有热缓存的快速 SSD 上,这一切都感觉不到。但在慢速磁盘、网络文件系统、冷缓存或低功耗 ARM 板上,这就是“灵敏”与“迟钝”的区别,而且它会在 shell 脚本生成的每个进程中成倍增加。具体来说,我们跟踪最密切的两个工作负载如下:
| Workload | Loaded libraries | DT_RUNPATH dirs | Failing .so opens |
|---|---|---|---|
| devenv version | 83 | 12 (leaf binary) | ~486 |
| imagemagick magick —version | 91 | 35 | ~1225 |
The wider a binary’s own DT_RUNPATH and the deeper its transitive graph, the worse the storm.
二进制文件自身的 DT_RUNPATH 越宽,其传递依赖图越深,风暴就越严重。
What a good fix has to preserve
一个好的修复方案必须保留什么
The reason this problem has stayed open so long is that the obvious fixes break things people rely on. Any serious solution is judged against a checklist:
- LD_LIBRARY_PATH override. NixOS injects the GPU driver by putting
/run/opengl-driver/libonLD_LIBRARY_PATH. If a fix stops that from winning, graphics break. - LD_PRELOAD. Interposers and shims must still load first.
- The libGL / glvnd runtime swap. A program built against Mesa must be able to pick up the vendor driver at runtime.
- Two libraries with the same soname. This is the heart of the Nix model: different parts of one closure can legitimately depend on different builds of the same soname, and resolution must stay per object.
- dlopen. Plugins loaded at runtime are a related but separate problem.
- Cross compilation. A fix that has to run the target loader cannot cross compile cleanly.
- Disk and closure size. Whatever metadata you add ships in every NAR.
- Maintenance burden. A glibc or loader patch has to be rebased onto every new glibc release, and patching glibc rebuilds the world.
这个问题之所以长期悬而未决,是因为显而易见的修复方案会破坏人们所依赖的功能。任何严肃的解决方案都必须通过以下清单的评估:
- LD_LIBRARY_PATH 覆盖。 NixOS 通过将
/run/opengl-driver/lib放入LD_LIBRARY_PATH来注入 GPU 驱动程序。如果修复方案阻止了这一点,图形功能就会崩溃。 - LD_PRELOAD。 拦截器和垫片(shims)必须仍然优先加载。
- libGL / glvnd 运行时切换。 针对 Mesa 构建的程序必须能够在运行时获取供应商驱动程序。
- 具有相同 soname 的两个库。 这是 Nix 模型的核心:同一闭包的不同部分可以合法地依赖于同一 soname 的不同构建版本,解析必须保持在对象级别。
- dlopen。 在运行时加载的插件是一个相关但独立的问题。
- 交叉编译。 必须运行目标加载器的修复方案无法干净地进行交叉编译。
- 磁盘和闭包大小。 你添加的任何元数据都会随每个 NAR 包一起分发。
- 维护负担。 glibc 或加载器的补丁必须重新基于每个新的 glibc 版本,而修补 glibc 会导致整个世界(所有包)重新构建。
No approach so far ticks every box. The interesting part is how each one chooses which boxes to give up.
到目前为止,还没有哪种方法能满足所有要求。有趣的是,每种方法是如何选择放弃哪些要求的。
Approach 1: freeze the resolution with absolute paths
方法 1:使用绝对路径冻结解析
The simplest idea: rewrite every DT_NEEDED entry from a bare soname like libfoo.so.1 to the absolute store path of the library it resolves to. glibc has a “slash short circuit”: a DT_NEEDED containing a / is opened directly, skipping all search. No search means no storm, and not even the glibc-hwcaps probes happen.
最简单的想法:将每个 DT_NEEDED 条目从像 libfoo.so.1 这样的裸 soname 重写为它所解析到的库的绝对存储路径。glibc 有一个“斜杠短路”机制:包含 / 的 DT_NEEDED 会被直接打开,跳过所有搜索。没有搜索就意味着没有风暴,甚至连 glibc-hwcaps 的探测都不会发生。
This is well trodden ground: Farid Zakaria’s shrinkwrap and the nix-harden-needed tool do exactly this as external post processing. Shrinkwrap is described in the paper Mapping Out the HPC Dependency Chaos (Zakaria, Scogland, Gamblin, Maltzahn, 2022; arXiv:2211.05118), which measures the storm directly: an Emacs launch drops from 1823 stat/openat syscalls to 104, a 36 times speedup, and a 900 library MPI application starting across 2048 processes on NFS goes from 344.6s to 47.8s, 7.2 times faster. Those NFS numbers are the clearest evidence that this overhead, invisible on a warm local cache, becomes brutal on a network or co…
这是一个已经被充分探索的领域:Farid Zakaria 的 shrinkwrap 和 nix-harden-needed 工具正是作为外部后处理来执行此操作的。Shrinkwrap 在论文《Mapping Out the HPC Dependency Chaos》(Zakaria, Scogland, Gamblin, Maltzahn, 2022; arXiv:2211.05118)中有详细描述,该论文直接测量了这种风暴:Emacs 的启动过程从 1823 次 stat/openat 系统调用减少到 104 次,速度提升了 36 倍;一个在 NFS 上跨 2048 个进程启动的 900 库 MPI 应用程序,时间从 344.6 秒缩短到 47.8 秒,速度提升了 7.2 倍。这些 NFS 数据最清楚地证明了这种开销在本地热缓存中是不可见的,但在网络或……上会变得极其严重。