The postmodern build system
The postmodern build system
Rest of the post This is a post about an idea. I don’t think it exists, and I am unsure if I will be the one to build it. However, something like it should be made to exist. 文章余篇 这是一篇关于某个构想的文章。我认为目前它还不存在,也不确定自己是否会去实现它。然而,类似的东西应当被创造出来。
What we want Trustworthy incremental builds: move the incrementalism to a distrusting layer such that the existence of incremental-build bugs requires hash collisions. Such a goal implies sandboxing the jobs and making them into pure functions, which are rerun when the inputs change at all. This is inherently wasteful of computation because it ignores semantic equivalence of results in favour of only binary equivalence, and we want to reduce the wasted computation. However, for production use cases, computation is cheaper than the possibility of incremental bugs. 我们的目标 可信的增量构建:将增量机制转移到一个“不信任”层,使得增量构建错误的出现必须依赖于哈希碰撞。这一目标意味着需要对任务进行沙盒化,并将其转化为纯函数,在输入发生任何变化时重新运行。这在本质上是浪费计算资源的,因为它忽略了结果的语义等价性,而只追求二进制等价性,而我们希望减少这种浪费。然而,对于生产环境而言,计算成本远低于出现增量构建错误的风险。
This can be equivalently phrased in terms of lacking “build identity”: is there any way that the system knows what the “previous version” of the “same build” is? A postmodern build system doesn’t have build identity, because it causes problems for multitenancy among other things: who decides what the previous build is? Maximize reuse of computation across builds. Changing one source file should rebuild as little as absolutely necessary. Distributed builds: We live in a world where software can be compiled much faster by using multiple machines. Fortunately, turning the build into pure computations almost inherently allows distributing it. 这也可以等同地表述为缺乏“构建标识”(build identity):系统是否有办法知道“同一构建”的“前一个版本”是什么?后现代构建系统没有构建标识,因为它会引发多租户等问题:谁来决定前一个构建是什么?(我们的目标是)最大化构建间的计算复用。修改一个源文件应该只触发绝对必要的重构。分布式构建:我们生活在一个可以通过多台机器更快编译软件的世界里。幸运的是,将构建转化为纯计算过程,几乎天然地支持了分布式执行。
Review Build systems à la carte
This post uses language from “Build systems à la carte”:
Monadic: a build that needs to run builds to know the full targets. It’s so called because of the definition of the central operation on monads: bind :: Monad m => m a -> (a -> m b) -> m b. This means that, given a not-yet-executed action returning a and a function taking the resolved result of that action, you get a new action whose shape depends an arbitrarily large amount on the result of m a. This is a dynamic build plan since the full knowledge of the build plan requires executing m a.
回顾《Build systems à la carte》
本文使用了《Build systems à la carte》中的术语:
Monadic(单子式):指需要运行构建过程才能获知完整目标的构建。之所以这样命名,是因为单子核心操作的定义:bind :: Monad m => m a -> (a -> m b) -> m b。这意味着,给定一个尚未执行且返回 a 的动作,以及一个接收该动作解析结果的函数,你将得到一个新的动作,其形态在很大程度上取决于 m a 的结果。这是一个动态构建计划,因为要完全获知构建计划,必须先执行 m a。
Applicative: a build for which the plan is statically known. Generally this implies a strictly two-phase build where the targets are evaluated, a build plan made, and then the build executed. This is so named because of the central operation on applicative types: apply :: Applicative f => f (a -> b) -> f a -> f b. This means, given a predefined pure function inside a build, the function can be executed to perform the build. But, the shape of the build plan is known ahead of time, since the function cannot execute other builds.
Applicative(应用式):指构建计划在静态下已知的构建。通常这意味着严格的两阶段构建:先评估目标,生成构建计划,然后执行构建。之所以这样命名,是因为应用类型(Applicative types)的核心操作:apply :: Applicative f => f (a -> b) -> f a -> f b。这意味着,给定构建内预定义的纯函数,该函数可以被执行以完成构建。但构建计划的形态是预先已知的,因为该函数无法执行其他构建。
Nix
As much of a Nix shill as I am, Nix is not the postmodern build system. It has some design flaws that are very hard to rectify. Let’s write about the things it does well, that are useful to adopt as concepts elsewhere. Nix is a build system based on the idea of a “derivation”. A derivation is simply a specification of an execution of execve. Its output is then stored in the Nix store (/nix/store/*) based on a name determined by hashing inputs or outputs.
Nix
尽管我是 Nix 的忠实拥趸,但 Nix 并不是那个“后现代构建系统”。它存在一些极难修复的设计缺陷。让我们谈谈它做得好的地方,以及哪些概念值得在其他地方借鉴。Nix 是一个基于“派生”(derivation)概念的构建系统。派生本质上就是对 execve 执行过程的规范。其输出随后会被存储在 Nix 仓库(/nix/store/*)中,文件名由输入或输出的哈希值决定。
Memoization is achieved by skipping builds for which the output path already exists. This mechanism lacks build identity, and is multitenant: you can dump a whole bunch of different Nix projects of various versions on the same build machine and they will not interfere with each other because of the lack of build identity; the only thing to go off of is the hash. 记忆化(Memoization)是通过跳过那些输出路径已存在的构建来实现的。这种机制缺乏构建标识,且支持多租户:你可以在同一台构建机器上堆放大量不同版本、不同类型的 Nix 项目,它们之间不会产生干扰,因为没有构建标识;唯一依赖的只有哈希值。
The store path is either:
- Named based on the hash of the contents of the derivation: input-addressed. This is the case for building software, typically.
- Named based on the hash of the output: fixed-output. This is the case for downloading things, and in practice has a relaxed sandbox allowing network access. However, the output is then hashed and verified against a hardcoded value.
- Named based on the output hash of the derivation, which is not fixed: content-addressed. Note that ca-derivations have had a rocky deployment timeline and have been removed from Lix. See ca-derivations. 仓库路径的命名方式如下:
- 基于派生内容的哈希命名:输入寻址(input-addressed)。这通常用于构建软件。
- 基于输出的哈希命名:固定输出(fixed-output)。这通常用于下载资源,在实践中拥有较宽松的沙盒,允许网络访问。但输出结果随后会被哈希并与硬编码的值进行校验。
- 基于派生的输出哈希命名(非固定):内容寻址(content-addressed)。注意,ca-derivations 的部署过程坎坷,且已从 Lix 中移除。详见 ca-derivations。
(The JSON block for GNU hello derivation is omitted for brevity)
execve memoization and the purification of execve
Central to the implementation of Nix is the idea of making execve pure. This is a brilliant idea that allows it to be used with existing software, and probably would be necessary at some level in a postmodern build system. The way that Nix purifies execve is through the idea of “unknown i…
execve 的记忆化与纯化
Nix 实现的核心在于将 execve 纯化。这是一个天才的想法,它使得 Nix 能够与现有软件兼容,并且在后现代构建系统中,这很可能也是某种程度上的必要条件。Nix 纯化 execve 的方式是通过“未知 i…”的概念。