Building the deployment tool I wish I had

Building the deployment tool I wish I had

构建我梦寐以求的部署工具

It is 00:43 at night. I look at the plan and press y. “s4.ruuda.nl: connecting …” I hold my breath. “Applying” briefly flashes in my terminal window before settling on “done.” It worked. Now the real work can start. Pop stack frame. What was I doing again? 现在是凌晨 00:43。我看着计划并按下了 y。“s4.ruuda.nl: connecting…” 我屏住呼吸。“Applying” 在我的终端窗口中短暂闪烁,随后变成了 “done”。成功了。现在可以开始真正的工作了。弹出栈帧。我刚才在做什么来着?

People who have worked with me for a while might accuse me of suffering from not-invented-here syndrome. I prefer to call it “having higher standards.” Why subject yourself to endless frustration that mediocre tools inflict on you, when you can build your own tools that are actually nice to use? Anyway, what was I doing again? 和我共事过一段时间的人可能会指责我患有“非我所创症”(Not-Invented-Here syndrome)。我更愿意称之为“拥有更高的标准”。既然可以构建自己真正好用的工具,为什么要忍受平庸工具带来的无尽挫败感呢?总之,我刚才在做什么来着?

Right, I wanted to write a blog post about European digital sovereignty. It would be hypocritical to publish that on a blog hosted in the US, with a US-controlled hyperscaler. So let’s move that to Europe, easy enough. Push stack frame. Spawn a new VM, point the DNS records at … oh, right, DNS. I use Cloudflare for that. Another entity that Trump can order to stop providing services when shit hits the fan. Maybe I should self-host my DNS servers then. Push stack frame. 对了,我想写一篇关于欧洲数字主权的博文。如果把它发布在一个托管在美国、由美国控制的超大规模云服务商上的博客里,那太虚伪了。所以,把它迁移到欧洲吧,这很容易。压入栈帧。创建一个新的虚拟机,将 DNS 记录指向……哦,对了,DNS。我用的是 Cloudflare。这是另一个当情况恶化时,特朗普可以下令停止提供服务的实体。也许我应该自建 DNS 服务器。压入栈帧。

My webserver is a tiny VM that runs Nginx, plus Lego to renew certificates. The Nginx configuration grew somewhat complicated over the years, but I generate it with Nix so it’s fine, just two configuration files. I wrote a small Python script that copies the files to the server and restarts Nginx. The script served me well for the past years, but now I want to start running DNS servers. I need at least two servers now. And more systemd units, configuration files, zonefiles … The script is not going to cut it any more, I need serious cluster configuration management. Push stack frame. 我的 Web 服务器是一个运行 Nginx 的小型虚拟机,外加用于更新证书的 Lego。多年来,Nginx 的配置变得有些复杂,但我用 Nix 来生成它,所以没问题,只有两个配置文件。我写了一个小的 Python 脚本,将文件复制到服务器并重启 Nginx。这个脚本在过去几年里一直很好用,但现在我想开始运行 DNS 服务器。我现在至少需要两台服务器。还有更多的 systemd 单元、配置文件、区域文件……这个脚本已经不够用了,我需要专业的集群配置管理。压入栈帧。

“NixOS” I hear Arian’s voice whisper in my head. “Just use NixOS. It’s only one line to configure, services.nsd.enable = true.” He’s right of course. I already use Nix to build minimal EROFS images for Nginx and Lego. That’s how I run them on Flatcar. But I like the idea of a minimal base OS, and running my services from readonly chroots with no more binaries than needed. “Let’s not scope-creep this into switching distros right now,” I tell myself. “Let’s build a new deployment tool instead.” “NixOS”,我听到 Arian 的声音在我脑海中低语。“直接用 NixOS 吧。配置只需要一行代码:services.nsd.enable = true。” 他当然是对的。我已经在使用 Nix 为 Nginx 和 Lego 构建最小化的 EROFS 镜像了。这就是我在 Flatcar 上运行它们的方式。但我喜欢最小化基础操作系统的理念,并希望在只读的 chroot 环境中运行服务,且不包含任何多余的二进制文件。“别让这件事演变成更换发行版,” 我告诉自己,“还是构建一个新的部署工具吧。”

How it looks

它的样子

It is now one month later, and Deptool exists. This is me updating my DNS records: 一个月后,Deptool 诞生了。这是我更新 DNS 记录时的样子:

$ deptool deploy s4.ruuda.nl update nsd ~ zones/ruuda.nl.zone restart unit nsd.service
s5.ruuda.nl update nsd ~ zones/ruuda.nl.zone restart unit nsd.service
Auto-rollback if deploy fails. Apply to 2 hosts in cluster 'prod'? [y/N/d] y
s4.ruuda.nl: done
s5.ruuda.nl: done
Changes deployed successfully to 2 hosts in 0.99s.

In this post we’ll walk through how it works, but let’s not run ahead. How did I get here? 在这篇文章中,我们将深入探讨它是如何工作的,但先别急。我是怎么走到这一步的呢?

Wishlist

愿望清单

If I’m going to build my own tool … what would an actually nice configuration management tool look like? Here we can look to Ansible for guidance: it made all the mistakes so that others can learn from them. I want my tool to be: 如果我要构建自己的工具……一个真正好用的配置管理工具应该是什么样的?我们可以参考 Ansible:它犯了所有的错误,以便他人能从中吸取教训。我希望我的工具具备以下特点:

  • Fast. A configuration update should be sub-second. There’s no fundamental reason for it to be slower than that, even a transatlantic ping is only 100ms. 快速。 配置更新应该在亚秒级完成。没有根本理由让它变慢,即使是跨大西洋的 ping 也只有 100ms。
  • Predictable. The tool should show me what it’s going to do, and then do just that. Like OpenTofu, with a separate plan and apply phase. Not like Ansible, where check mode is useless because every imperative step can trigger a cascade of changes that are only known after executing the step. And where nothing prevents the host from changing between the check and the real run, making the check more of a vibe check than something you can depend on. 可预测。 工具应该向我展示它要做什么,然后严格执行。就像 OpenTofu 一样,有独立的计划(plan)和应用(apply)阶段。不像 Ansible,它的检查模式(check mode)毫无用处,因为每一个命令式步骤都可能引发一系列连锁反应,只有在执行后才知道结果。而且,没有任何机制能防止主机在检查和实际运行之间发生变化,这使得检查更像是一种“感觉测试”,而不是你可以依赖的东西。
  • Safe. If I break my Nginx configuration, I don’t want my webserver to be down for minutes while I frantically try to fix it. (Me: “Ah, only a small typo, faster to just fix it than to try and restore the previous version.” Narrator: “If only that typo were the only problem …”) No, I want the tool to automatically roll back for me. In milliseconds. 安全。 如果我弄坏了 Nginx 配置,我不希望我的 Web 服务器在我不顾一切地尝试修复时宕机几分钟。(我:“啊,只是个小拼写错误,直接修复比恢复旧版本更快。” 旁白:“要是那个拼写错误是唯一的问题就好了……”)不,我希望工具能自动为我回滚。在毫秒级完成。
  • Simple. I just need to copy configuration files from my laptop to my servers and restart a few systemd units. I don’t need to solve every deployment problem for everybody; I don’t need control flow or arbitrary code execution. I do need to be able to template — excuse me, generate — configuration files, but a separate tool can do that. 简单。 我只需要将配置文件从笔记本电脑复制到服务器,并重启几个 systemd 单元。我不需要解决所有人的所有部署问题;我不需要控制流或任意代码执行。我确实需要能够模板化——抱歉,是生成——配置文件,但有一个单独的工具可以做到这一点。
  • Declarative. If I remove a file or application from my config, it should be removed from the server. I don’t want to have to add explicit cleanup steps, and end up with drift and lingering files when I inevitably forget that. 声明式。 如果我从配置中删除了一个文件或应用程序,它也应该从服务器上被删除。我不希望必须添加显式的清理步骤,否则当我不可避免地忘记时,最终会导致配置漂移和残留文件。
  • Zero-setup. I want to use this tool to manage my servers right after provisioning them. I don’t want to manually install agents, daemons, or dependencies, and I don’t want to have to enroll or register the host anywhere, because then I’d have the new problem of automating that. 零配置。 我希望在配置服务器后立即使用此工具来管理它们。我不想手动安装代理、守护进程或依赖项,也不想在任何地方注册主机,因为那样我会面临自动化注册的新问题。

Decouple distribution

解耦分发

The core idea feels obvious in hindsight, which is maybe why it keeps appearing everywhere. At work, David had recently built Unsible, a tool that takes Ansible playbooks, but instead of executing them step by step, it locally builds a tarball and ships it to the host, where it mostly just needs to put the files in place. I like this idea of decoupling configuration generation from distribution. It’s how my barebones deploy script worked too: build the configuration externally, and the script is mostly a dumb file copier. In a sense, NixOS is this idea applied to the local system. We can learn more from Nix: store generated artifacts in a place where different versions can coexist, and limit the imperative parts of system administration to a small activation step that swaps a few symlinks. This design works well for both package management and system configuration. 事后看来,这个核心思想显而易见,这也许就是它随处可见的原因。在工作中,David 最近构建了 Unsible,这是一个接收 Ansible playbook 的工具,但它不是一步步执行,而是在本地构建一个 tarball 并将其发送到主机,主机上只需要将文件放置到位即可。我喜欢这种将配置生成与分发解耦的想法。这也是我最初那个简陋的部署脚本的工作方式:在外部构建配置,脚本主要就是一个简单的文件复制器。从某种意义上说,NixOS 就是将这一思想应用于本地系统。我们可以从 Nix 中学到更多:将生成的工件存储在不同版本可以共存的地方,并将系统管理的命令式部分限制在一个小的激活步骤中,该步骤只需交换几个符号链接。这种设计对于包管理和系统配置都非常有效。

Hello Deptool

你好,Deptool

With the scope reduced to distributing small files and running a simple activation step, deployment becomes a tractable problem to solve. I call my take on it Deptool. Here’s how it works. 将范围缩小到分发小文件和运行简单的激活步骤后,部署就成了一个可以解决的问题。我把我这个方案称为 Deptool。它是这样工作的:

Pre-render config files for the entire cluster. Store them in a directory on disk. This directory tree is two levels deep: a directory per target host at the top level, a directory per application below that. 为整个集群预渲染配置文件。将它们存储在磁盘上的一个目录中。这个目录树有两层深:顶层是每个目标主机的目录,下面是每个应用程序的目录。

Put that in a Git 将其放入 Git 中。