I made the database compute everything: building an SLA-credit system of record on Aurora PostgreSQL + Vercel
I made the database compute everything: building an SLA-credit system of record on Aurora PostgreSQL + Vercel
我让数据库完成了所有计算:在 Aurora PostgreSQL + Vercel 上构建 SLA 赔付记录系统
I built this on nights and weekends to learn a stack I keep making decisions about but rarely touch with my own hands. Sharing what I learned in case it’s useful to anyone doing the same. 我利用业余时间构建了这个系统,目的是学习一套我经常做决策但很少亲手操作的技术栈。分享我的心得,希望能对有同样需求的人有所帮助。
I’m a principal product manager, and I’ve spent more nights than I’d like on incident bridges — watching a service degrade in real time, then writing the document that goes to the customer afterward: here’s what broke, here’s how long, here’s what we’re doing so it doesn’t happen again. Owning that accountability up close teaches you something the dashboards don’t: the hardest question isn’t “what happened” — it’s “what do we owe, and can we prove it?” 我是一名首席产品经理,曾在事故处理会议上度过了无数个不眠之夜——实时观察服务降级,事后还要撰写发给客户的报告:哪里坏了、持续了多久、我们采取了什么措施以防再次发生。近距离承担这种责任教会了我仪表盘无法展示的东西:最难的问题不是“发生了什么”,而是“我们欠客户多少,且我们能证明吗?”
Building integrations is my actual job, which means I spend a lot of time thinking about the teams on the other side of an outage — the people who have to turn an incident into a number a customer will accept. And the same pains kept showing up. Getting the data is a scavenger hunt across systems that were never built to agree. Asking “what if we’d classified this differently” means re-running everything by hand, so nobody does until they’re forced to. 我的本职工作是构建集成,这意味着我花了很多时间思考故障另一端的团队——那些必须将事故转化为客户可接受的赔付金额的人。同样的问题不断出现:获取数据就像是在从未设计过互通的系统间进行寻宝;想要询问“如果我们对事故的分类不同会怎样”,意味着必须手动重新计算一切,所以除非被迫,否则没人会这么做。
When a customer disputes the number, “I think so” is the most honest answer anyone can give — on the one topic where you can least afford it: money owed. And when a contract or a severity turns out to have been wrong, correcting a credit you already settled is a mess nobody wants to touch. None of those are calculation problems. They’re proof problems. So on nights and weekends, I built a system of record designed to solve them. 当客户对金额提出异议时,“我想应该是这样”是人们能给出的最诚实的回答——但这恰恰是赔付金额这一最不容含糊的话题上最糟糕的答案。当合同或严重程度定级出错时,修正已经结算的赔付是一场没人愿意触碰的灾难。这些都不是计算问题,而是证明问题。因此,我利用业余时间构建了一个旨在解决这些问题的记录系统。
The result is Attest — a system of record for the SLA credits a B2B company owes its customers. Not a calculator, not a monitoring dashboard. The thing that can answer, with a receipt, “how much do we owe this customer, and can you prove it?” Here’s what I learned making PostgreSQL — on Amazon Aurora — the actual product, with Vercel as a thin layer in front of it. 成果就是 Attest——一个为 B2B 公司记录其欠客户 SLA 赔付的系统。它不是计算器,也不是监控仪表盘。它是一个能凭据回答“我们欠该客户多少钱,且你能证明吗?”的系统。以下是我在以 PostgreSQL(运行于 Amazon Aurora)为核心产品、Vercel 为前端轻量层构建该系统时的心得。
The problem, briefly
简述问题
When a service misses its SLA, the customer is owed a credit. Sounds like arithmetic. It isn’t. The number depends on: how long the outage really lasted, which minutes the contract excludes (scheduled maintenance), how severe the incident was classified, which contract version was in force at the time, where the month’s total downtime lands against a tiered schedule, and the customer’s monthly charge. 当服务未达到 SLA 时,客户应获得赔付。听起来像是算术题,其实不然。金额取决于:故障实际持续时间、合同排除的时间段(计划内维护)、事故严重程度定级、当时生效的合同版本、当月总停机时间在分级表中的位置,以及客户的月费。
Those inputs live in five different systems that were never built to talk to each other. The number is one value; assembling it by hand takes days, and when a customer disputes it, nobody can prove it in the room. The interesting realization: the hard part isn’t the math. It’s the proof. And proof is an architecture decision. 这些输入数据分散在五个从未设计过互通的系统中。最终金额只是一个数值,但手动汇总它需要几天时间,且当客户提出异议时,现场没人能证明其准确性。一个有趣的发现是:难点不在于数学计算,而在于证明。而证明是一个架构决策。
The core decision: the app computes nothing
核心决策:应用层不进行任何计算
Most apps treat the database as a place to store rows and do the real work in application code. I inverted that. In Attest, the Next.js layer on Vercel passes parameters, renders rows, and does no credit math at all — no tier lookups, no severity weighting, no downtime subtraction. A credit lookup is literally: SELECT * FROM compute_credit($1, $2::date);
大多数应用将数据库视为存储行的地方,而在应用代码中处理实际逻辑。我颠倒了这一点。在 Attest 中,Vercel 上的 Next.js 层只负责传递参数和渲染行,完全不进行赔付计算——没有分级查询、没有严重程度加权、没有停机时间减法。赔付查询实际上就是:SELECT * FROM compute_credit($1, $2::date);
The route returns that row as-is. Everything that produces the number lives in the database. Why bother? Because “the database computed it” is verifiable in a way “the app computed it” never is. If the math lives in TypeScript scattered across handlers, proving a number is correct means auditing code paths. If it lives in one SQL function, the derivation is a single, inspectable source of truth. For a product whose entire value is defensibility, that’s not a purity exercise — it’s the feature. 路由直接返回该行数据。所有产生该数值的逻辑都驻留在数据库中。为什么要这么做?因为“数据库计算得出”具有“应用计算得出”所不具备的可验证性。如果数学逻辑分散在 TypeScript 处理程序中,证明数值正确意味着要审计代码路径;如果它存在于一个 SQL 函数中,推导过程就是一个单一、可审查的真理来源。对于一个核心价值在于“可辩护性”的产品来说,这不仅仅是技术洁癖,而是核心功能。
Where PostgreSQL earns its keep: range types
PostgreSQL 的杀手锏:范围类型 (Range Types)
The piece I’d never used before and now love: range and multirange types. The “credited downtime” for an incident is the outage minus the maintenance windows the contract excludes. That’s set subtraction over time intervals — exactly what tstzmultirange is for.
我以前从未使用过但现在爱不释手的功能:范围 (range) 和多范围 (multirange) 类型。事故的“赔付停机时间”等于故障时间减去合同排除的维护窗口。这本质上是时间间隔上的集合减法——这正是 tstzmultirange 的用武之地。
The heart of the whole system is one expression: 整个系统的核心是一个表达式:
tstzmultirange(ii.impact_window) - COALESCE( range_agg(mw.maint_window * ii.impact_window), '{}'::tstzmultirange )
Read it left to right: take the incident’s impact window as a multirange, subtract the union of every maintenance window clipped to that incident (* is range intersection, range_agg folds the windows together, - is multirange difference). What comes back is the non-contiguous set of minutes that actually count — maybe two separate segments with a 14-minute hole carved out of the middle.
从左到右阅读:将事故的影响窗口视为一个多范围,减去所有与该事故重叠的维护窗口的并集(* 是范围交集,range_agg 将窗口合并,- 是多范围差集)。返回的结果是实际计入的非连续分钟数——可能是两个独立的时间段,中间扣除了 14 分钟的空隙。
No loops, no manual interval-merging in app code, no off-by-one bugs reconstructing intervals by hand. The database models time intervals as first-class values, and the subtraction is one operator. This matters more than it looks, because that credited-minutes number feeds everything downstream: it sets the month’s total downtime, which sets the uptime percentage, which determines the tier, which sets the dollar amount. 没有循环,没有应用代码中的手动间隔合并,也没有手动重构间隔时产生的“差一错误”。数据库将时间间隔建模为一等公民,减法仅需一个运算符。这比看起来更重要,因为这个赔付分钟数决定了下游的一切:它设定了当月的总停机时间,进而决定了正常运行时间百分比,从而确定了赔付等级,最终决定了赔付金额。
A few minutes of error in the subtraction can move the total across a tier boundary and change the credit by thousands. Getting that exactly right — and having it reconcile end to end — was the part that took the most iteration. The payoff is the most dramatic moment in the product: toggle whether one 14-minute maintenance window is excluded, and a credit steps from $2,400 to $6,000 — a $3,600 swing — because those minutes push the month across the 99.0% line into a worse tier. 减法中几分钟的误差就可能导致总数跨越等级边界,使赔付金额产生数千美元的差异。确保其绝对准确并实现端到端对账,是我迭代次数最多的部分。其回报是产品中最具戏剧性的时刻:切换是否排除一个 14 分钟的维护窗口,赔付金额就会从 2,400 美元跳升至 6,000 美元——3,600 美元的波动——因为这几分钟将当月的正常运行时间推过了 99.0% 的界限,进入了更差的赔付等级。
The right index for the job: GiST on ranges
任务的最佳索引:GiST 范围索引
Overlap queries — “which impacts fall in this month?”, “which maintenance windows touch this incident?” — are the access pattern this whole system runs on. So the range columns (incident windows, impact windows, maintenance windows, classification valid-time, contract effective-ranges) get GiST indexes, the index type built for range/geometric overlap. The win: an overlap query resolves through a GiST index scan rather than a sequential scan over the whole table. 重叠查询——“哪些影响落在本月?”、“哪些维护窗口触及了此事故?”——是整个系统的访问模式。因此,范围列(事故窗口、影响窗口、维护窗口、分类有效期、合同生效范围)都使用了 GiST 索引,这是专为范围/几何重叠构建的索引类型。其优势在于:重叠查询通过 GiST 索引扫描来解析,而不是对整个表进行全表扫描。