Recovering a status page from a half-finished schema migration

Recovering a status page from a half-finished schema migration

从半途而废的数据库模式迁移中恢复状态页面

The log line was ‘database schema version 23 found, expected 21’ and the pod exited immediately after printing it. The team had already tried a Helm rollback to the previous chart version. That pod printed the inverse: ‘database schema version 23 found, expected 21’ against a binary that wanted 21. Both versions refused to start against the same database, and the company’s only public status page had been down for 38 minutes. 日志显示“发现数据库模式版本 23,预期为 21”,Pod 在打印该信息后立即退出。团队此前已经尝试通过 Helm 回滚到之前的 Chart 版本。但该 Pod 打印出了相反的信息:“发现数据库模式版本 23,预期为 21”,而该二进制文件原本需要的是 21 版本。两个版本都拒绝在同一个数据库上启动,公司唯一的公共状态页面已经宕机了 38 分钟。

The migration job had been OOMKilled mid-run during the upgrade, and Postgres was now in a state neither binary recognized. 迁移任务在升级过程中被 OOMKilled(内存溢出杀死),Postgres 现在处于一种两个二进制版本都无法识别的状态。

Problem signals:

问题信号:

  • Application logs ‘schema version N found, expected M’ and exits before serving traffic
  • 应用程序日志显示“发现模式版本 N,预期为 M”,并在提供服务前退出
  • Helm rollback to the previous chart version fails with the same or inverse schema error
  • Helm 回滚到之前的 Chart 版本时,出现相同或相反的模式错误
  • A migration job pod shows exit code 137 or OOMKilled in kubectl describe
  • 迁移任务的 Pod 在 kubectl describe 中显示退出代码 137 或 OOMKilled
  • The schema_migrations (or equivalent) table reports a version the table DDL does not actually match
  • schema_migrations(或同类)表报告的版本与实际的表 DDL 不匹配
  • Restoring from the most recent Postgres backup would lose hours of production data the team needs to keep
  • 从最近的 Postgres 备份恢复会导致丢失团队需要保留的数小时生产数据
  • Both the old and new binary refused to start against the same database
  • 新旧二进制文件都拒绝在同一个数据库上启动

The log line that ruled out a rollback

排除回滚可能性的日志行

The on-call had done the obvious thing first. The new chart was failing, so they ran helm rollback to the previous revision. The previous revision’s pod came up, hit the database, and crashed with the mirror image of the same error. New code wanted schema 23 and saw 21. Old code wanted 21 and saw 23. Both were sort of right. That symmetry is what told us the database was the problem, not the chart. A clean rollback should have produced a running pod. If both binaries reject the same database, the database is not in either of the states they expect. It is in a third state nobody coded for. 值班人员首先做了显而易见的操作。由于新 Chart 失败,他们运行了 helm rollback 回滚到上一个版本。旧版本的 Pod 启动并连接数据库,随后因同样的错误(镜像版本)而崩溃。新代码需要模式 23,却看到了 21;旧代码需要 21,却看到了 23。两者在某种程度上都是对的。这种对称性告诉我们,问题出在数据库上,而不是 Chart 本身。正常的回滚应该能产生一个运行正常的 Pod。如果两个二进制文件都拒绝同一个数据库,说明数据库既不是它们预期的任何一种状态,而是处于一种无人编写过代码处理的“第三种状态”。

$ kubectl logs -n statuspage statuspage-app-7b9f-xq2vk
INFO starting statuspage v0.90.78
INFO connecting to postgres at postgres.statuspage.svc:5432
ERROR database schema version 23 found, expected 21
FATAL refusing to start with schema version mismatch

$ kubectl rollout history deployment/statuspage-app -n statuspage
REVISION  CHANGE-CAUSE
14        helm upgrade statuspage statuspage/statuspage --version 0.91.2
15        helm rollback statuspage 14

$ kubectl logs -n statuspage statuspage-app-6c4d-7m9pz # rolled-back pod
ERROR database schema version 23 found, expected 21
FATAL refusing to start with schema version mismatch

Same error from both chart revisions. The database, not the chart, was the problem. The version row claimed 23. The table DDL was somewhere between 22 and 23. 两个 Chart 版本都报同样的错误。问题在于数据库,而非 Chart。版本行声称是 23,但表 DDL 实际上处于 22 和 23 之间。

What the schema_migrations table actually said

schema_migrations 表的实际情况

We dropped into psql against the application database and pulled the migration tracking table. The row said version 23, applied. The dirty flag was true, which on most migration libraries means ‘a migration started running and never reported success’. That single boolean was the thread we pulled on for the next hour. 我们进入 psql 连接到应用程序数据库,并拉取了迁移跟踪表。该行显示版本 23 已应用。dirty 标志为 true,在大多数迁移库中,这意味着“迁移已开始运行但从未报告成功”。这一个布尔值就是我们在接下来一小时内顺藤摸瓜的线索。

statuspage=> select * from schema_migrations;
 version | dirty 
---------+-------
      23 | t
(1 row)

statuspage=> \d incidents
-- expected per migration 0023: severity column, incident_updates FK, partial index on resolved_at IS NULL
-- present: none of the above

The version row was lying. The table structure was a partial 23. We pulled the migration files out of the chart’s image and read them. Migration 0023 was three statements: add a severity column, create an incident_updates table with a foreign key back, create a partial index on unresolved incidents. None of the three were present in the live schema. 版本行在撒谎。表结构只是部分处于 23 版本。我们从 Chart 镜像中提取了迁移文件并进行了阅读。迁移 0023 包含三条语句:添加 severity 列、创建带有外键关联的 incident_updates 表、为未解决的事件创建部分索引。这三项在实时模式中均不存在。

The OOM had hit after the migration library wrote the version row and before any DDL statement actually committed. Or possibly between statements. The order was implementation-specific and we did not care which exact moment because the answer was the same: none of 0023’s DDL had landed, but the bookkeeping said it had. OOM 发生在迁移库写入版本行之后,但在任何 DDL 语句实际提交之前。或者可能是在语句之间。顺序取决于具体实现,我们并不关心具体是哪一刻,因为结果是一样的:0023 的 DDL 没有任何一条生效,但记录却显示它已经完成了。

This is the specific failure mode that makes partial migrations dangerous. The migration library and the actual schema disagree, and the application trusts the library. The library trusts a row it wrote in a different transaction than the DDL it was supposed to be tracking. Most migration tools have fixed this since 2020 by wrapping version-row-and-DDL in a single transaction, but a long tail of applications still ship with the older split-transaction behaviour, and you only find out which one you have when something interrupts a migration. 这就是导致部分迁移变得危险的特定故障模式。迁移库与实际模式不一致,而应用程序却信任该库。该库信任的是它在与预期跟踪的 DDL 不同的事务中写入的行。自 2020 年以来,大多数迁移工具通过将版本行和 DDL 包装在单个事务中解决了这个问题,但仍有大量应用程序沿用旧的“拆分事务”行为,只有当迁移被中断时,你才会发现自己用的是哪一种。

Why we did not restore from backup

为什么我们没有从备份恢复

The base backup was 6 hours stale and uptime data is the product. The instinct, and the safe move on most days, is to restore Postgres from the last known-good base backup and replay WAL up to a point just before the migration started. We checked the backup. It was a nightly pg_basebackup, 6 hours old, and WAL archiving had been configured but never tested for PITR. We could probably have done it. We were not willing to bet the status page on ‘probably’ while the status page was already down. 基础备份已经滞后了 6 小时,而正常运行时间(uptime)数据就是我们的产品。通常的直觉和安全做法是从最后一次已知的良好基础备份恢复 Postgres,并重放 WAL 日志直到迁移开始前的那一刻。我们检查了备份,那是 6 小时前的每日 pg_basebackup,虽然配置了 WAL 归档,但从未进行过 PITR(时间点恢复)测试。我们或许能成功,但在状态页面已经宕机的情况下,我们不愿拿“或许”去赌。

More importantly, the uptime check history is the product. A status page that loses 6 hours of check data after an outage is worse than a status page that takes another hour to come back. We talked it through with the team lead and decided the database in front of us was recoverable, and recovering it was lower risk than the restore path. That decision is worth naming because it goes against the usual ‘just restore from backup’ instinct. When the data itself is the value, finishing a half-migration by hand is often the right call. 更重要的是,正常运行时间检查历史就是我们的产品。一个在故障后丢失了 6 小时检查数据的状态页面,比一个多花一小时才能恢复的状态页面更糟糕。我们与团队负责人讨论后决定,眼前的数据库是可以修复的,而且修复它的风险比执行恢复路径更低。这个决定值得一提,因为它违背了通常“直接从备份恢复”的直觉。当数据本身就是价值所在时,手动完成半途而废的迁移往往是正确的选择。