Meet Alice. Alice is impatient

Meet Alice. Alice is impatient

Meet Alice. Alice is impatient. 认识一下 Alice。Alice 很没耐心。

What do you mean? 这是什么意思?

Meet Alice. Alice uses your web service. Alice, like most humans, measures her time in seconds and minutes. Alice says your service is slow. You tell Alice that the mean request to your service completes in 100ms, but Alice says that her mean wait time is 1s. You’re both right. 认识一下 Alice。Alice 使用你的网络服务。像大多数人类一样,Alice 用秒和分钟来衡量时间。Alice 说你的服务很慢。你告诉 Alice,你的服务平均请求完成时间是 100 毫秒,但 Alice 说她的平均等待时间是 1 秒。你们俩都没错。

Meet Alex. Alex uses your web service. Alex, like most humans, measures his time in seconds and minutes. Alex says that when you have outages, they last a long time and he gets really annoyed. You tell Alex that your MTTR is less than 1 minute. Alex says that he sees the mean outage lasting 1 hour. Again, you’re both right. 认识一下 Alex。Alex 使用你的网络服务。像大多数人类一样,Alex 用秒和分钟来衡量时间。Alex 说当你们发生故障时,持续时间很长,这让他非常恼火。你告诉 Alex,你们的平均故障修复时间(MTTR)不到 1 分钟。Alex 说他经历的平均故障持续时间是 1 小时。同样,你们俩都没错。

What’s going on? What’s going on is that you’re measuring time in requests, or in outages, and Alex and Alice are measuring time in seconds and minutes. When you have a long request or a long outage, Alex and Alice count that as a long time, with a heavy weight. But you only count that as one. 这是怎么回事?原因在于你是在以“请求数”或“故障次数”来衡量时间,而 Alex 和 Alice 是在以“秒”和“分钟”来衡量时间。当你遇到一个耗时很长的请求或一次长时间的故障时,Alex 和 Alice 会将其视为一段很长的时间,并赋予其很高的权重。但你只将其计为一次事件。

More technically, what’s going on here is the inspection paradox. Alex and Alice don’t experience your latency distribution $f(t)$, they experience a t-weighted version of it. If you have a MTTR or mean request time of $\mathbb{E}[X]$, Alex and Alice experience $\mathbb{E}_a[X] = \frac{\mathbb{E}[X^2]}{\mathbb{E}[X]} = \mathbb{E}[X] + \frac{\mathrm{Var}(X)}{\mathbb{E}[X]}$. Most of the time they’re waiting, they’re waiting for things that take a long time. This is (roughly) how humans experience time. 从技术上讲,这里发生的是“检验悖论”(Inspection Paradox)。Alex 和 Alice 体验到的并不是你的延迟分布 $f(t)$,而是其时间加权版本。如果你的 MTTR 或平均请求时间为 $\mathbb{E}[X]$,那么 Alex 和 Alice 体验到的则是 $\mathbb{E}_a[X] = \frac{\mathbb{E}[X^2]}{\mathbb{E}[X]} = \mathbb{E}[X] + \frac{\mathrm{Var}(X)}{\mathbb{E}[X]}$。他们大部分的等待时间,其实都是在等待那些耗时很长的任务。这(大致)就是人类体验时间的方式。

Let’s play with this with a little simulation. Plug in your median latency (or recovery time), and 99th percentile latency (or recovery time), we’ll fit a log-normal distribution to it, and then plot both what your service metrics see and what your customers see. 让我们通过一个小模拟来演示一下。输入你的中位数延迟(或恢复时间)和第 99 百分位延迟(或恢复时间),我们将对其进行对数正态分布拟合,然后绘制出你的服务指标所看到的数据与你的客户所体验到的数据。

For example, put in 30 as the median (let’s ignore the milliseconds and pretend these are minutes for now) for a 30 minute Median TTR (i.e. in half of your postmortems you see a recovery time of $\leq 30$ minutes), and 600 in as the p99 (one in every 100 events, recovery takes 10 hours). Your MTTR is just over an hour. Your customers experience a mean time to recovery of around 6 hours! 例如,输入 30 作为中位数(暂时忽略毫秒,假设单位为分钟),即 30 分钟的中位 TTR(意味着在你一半的故障复盘中,恢复时间 $\leq 30$ 分钟),并将 600 输入为 p99(即每 100 次事件中就有一次恢复需要 10 小时)。你的 MTTR 仅略高于 1 小时。但你的客户体验到的平均恢复时间却长达约 6 小时!

There are many arguments for why tail latency (and long recovery times) are so important to understand (e.g. multiple samples), but this is the one that I think is the least widely understood. For service times, timeout-and-retry can hide this latency some of the time (as long as the running request doesn’t hold locks or other exclusive resources). But, for recovery time, no such hiding is possible. The heaviness if the tail matters a great deal. 关于为什么理解尾部延迟(以及长恢复时间)如此重要,有很多论据(例如多重采样),但我认为这是最不被广泛理解的一点。对于服务时间,超时重试机制有时可以掩盖这种延迟(只要正在运行的请求没有占用锁或其他独占资源)。但是,对于恢复时间,这种掩盖是不可能的。尾部的“沉重程度”至关重要。

This is also one of the reasons I don’t like trimmed measurements (like trimmed means) as a way of thinking about service latency or recovery time. They throw out some really critical context about the shape of the right tail that dominates the customer experience (the other reason is related to Little’s Law and capacity usage, which I’ve written about before). 这也是我不喜欢使用截尾测量(如截尾平均值)来思考服务延迟或恢复时间的原因之一。它们丢弃了一些关于右尾形状的关键背景信息,而这正是主导客户体验的部分(另一个原因与利特尔法则(Little’s Law)和容量使用率有关,我之前写过相关内容)。

A note on log-normal: I chose log-normal here for numerical convenience. It has the nice property that $\mathrm{lognormal}(\mu, \sigma^2)$ becomes $\mathrm{lognormal}(\mu + \sigma^2, \sigma^2)$. Also it’s well-behaved around 0. I don’t believe that log-normal is a particularly good choice of distribution for latency or recovery time metrics, and generally would approach these problems entirely non-parametrically. 关于对数正态分布的说明:我在这里选择对数正态分布是为了数值计算的方便。它有一个很好的特性,即 $\mathrm{lognormal}(\mu, \sigma^2)$ 会变为 $\mathrm{lognormal}(\mu + \sigma^2, \sigma^2)$。此外,它在 0 附近表现良好。我不认为对数正态分布是延迟或恢复时间指标的特别好的分布选择,通常我会完全采用非参数化的方法来处理这些问题。