That one time I used Go panics for flow control
That one time I used Go panics for flow control
那次我使用 Go 的 panic 进行流程控制的经历
How our protagonist discovered that a key service that powers our support was absurdly vulnerable to overload, and what we did to fix it. 我们的主角是如何发现一个支撑支持服务的关键系统在过载面前竟如此脆弱,以及我们是如何修复它的。
Part of our support infrastructure at work is an in-memory datastore, that allows us to query our outstanding support work over various dimensions, such as work type, whether it’s been put on hold for some reason, etc. It’s functionally equivalent to a single table in an SQL database, where you have a single dataset, boolean filters and configurable sorting. 我们工作中的部分支持基础设施是一个内存数据存储,它允许我们从多个维度查询待处理的支持工单,例如工单类型、是否因故挂起等。在功能上,它等同于 SQL 数据库中的单张表,拥有单一数据集、布尔过滤器和可配置的排序功能。
At work, we have an in-memory datastore that powers part of our support infrastructure. Its kind of analgous to having bitmap filters with post-hoc filtering, so any use of sort/limit will sort the entire result set. And the key part here, is that the result sets can be large enough that sorts can take one or two seconds. 我们在工作中使用的这个内存数据存储支撑了部分支持基础设施。它类似于带有事后过滤功能的位图过滤器,因此任何 sort/limit 操作都会对整个结果集进行排序。关键在于,结果集可能非常大,导致排序需要耗费一到两秒钟。
And for a bit of context, this service deployment wasn’t autoscaled at the time, and upstream services will retry failed requests. Sometimes after a relatively short timeout. Which is fun. 背景补充一下:当时该服务的部署并未配置自动扩缩容,且上游服务会在请求失败时进行重试,有时超时时间还设得非常短。这可真是太“有趣”了。
So, one day, this service had more query load than it can handle; and because of the inelasticity, it got overloaded, and queries started to take way longer (like, up to a minute vs. a typical time of up to 1-2s). 于是有一天,该服务承受的查询负载超过了其处理能力;由于缺乏弹性,它陷入了过载,查询耗时开始大幅增加(从通常的 1-2 秒增加到长达一分钟)。
Unfortunately, because this was an incident, and sometimes the panic sets in, one of my theories was that memory had gotten slower. Which of course was absurd, but under time pressure, incident brain can be very real. 不幸的是,由于这是一次突发事故,人难免会陷入恐慌,我当时甚至怀疑是内存变慢了。这当然很荒谬,但在时间压力下,“事故脑”确实会让人产生这种想法。
However, as earlier foreshadowed, this service had simply became overloaded, so we not only had slightly higher than average demand, but also failure demand from retries. 然而,正如前面所预示的,该服务仅仅是因为过载了。我们不仅面临着略高于平均水平的正常需求,还承受着来自重试机制的“失败需求”。
Most of the time in a Go service, we pass around a context, so that when the caller gives up on us, we can cancel the operation, short-circuit and bail early. However, when we were able to get a cpu profile and take a look, the vast majority of the CPU time was taken up in the sort phase of the query.
在 Go 服务中,我们通常会传递 context,以便在调用方放弃请求时,我们可以取消操作、短路并提前退出。然而,当我们获取 CPU 分析报告并查看时,发现绝大部分 CPU 时间都消耗在了查询的排序阶段。
In go, none of the sort functions support cancellation (reasonably so, as normally you’re either in a batch context, or sorting small enough counts that the time taken isn’t significant). So, what to do? 在 Go 中,没有任何排序函数支持取消操作(这很合理,因为通常你处于批处理上下文中,或者排序的数据量很小,耗时并不显著)。那么,该怎么办呢?
Normally, context cancellation has leaf functions check for an error, and then propagate it via the typical errors-as-values mechanism. However, none of the sort functions (eg: sort.Sortfunc) take a context, or allow returning an error.
通常,context 取消机制要求叶子函数检查错误,并通过典型的“错误即值”机制进行传播。然而,没有任何排序函数(如 sort.Sortfunc)接收 context 参数,也不允许返回错误。
Thankfully, Go has another, non-local signalling mechanism for handling errors (eg: if you’ve dereferenced a nil pointer), in the form of panics. This tends not to be used much for error handling per-se, because the non-local flow control can be harded to reason about, but it can make sense within a single narrowly defined context.
幸运的是,Go 还有另一种用于处理错误的非局部信号机制(例如当你解引用空指针时),即 panic。这种方式通常不直接用于错误处理,因为非局部流程控制难以推导,但在某些狭义的上下文中是有意义的。
For example, the encoding/json package does this, for example throwing via json.(*encodingState).error(…), and recovering within the scope of the top level json.(*encodingState).marshal(…) function. So no client code actually sees the non-local control flow, and no engineers experience unexpected panics.
例如,encoding/json 包就使用了这种方式:通过 json.(*encodingState).error(...) 抛出异常,并在顶层的 json.(*encodingState).marshal(...) 函数范围内进行恢复。因此,客户端代码实际上感知不到这种非局部控制流,工程师也不会遇到意外的 panic。
So we changed the code from something like this: 于是我们将代码从这样:
func execute(ctx context.Context) (results, error) {
resultSet := query.filter(someTable)
slices.SortFunc(resultSet, func(a, b Row) int {
return query.compare(a, b)
})
}
To something like this: 改成了这样:
type nonLocalCancellation struct {err error}
func execute(ctx context.Context) (results, error) {
resultSet := query.filter(someTable)
var sortErr error
defer func() {
// Ref: https://go.dev/blog/defer-panic-and-recover
if r := recover(); r != nil {
if c, ok := r.(nonLocalCancellation); ok {
sortErr = c.err
} else {
panic(r)
}
}
}()
slices.SortFunc(resultSet, func(a, b Row) int {
if ctx.Err() != nil {
panic(nonLocalCancellation{err: ctx.Err()})
}
return query.compare(a, b)
})
if sortErr != nil {
return nil, sortErr
}
return resultSet, nil
}
Which, is a lot of messing about (it’s an ugly solution to an ugly problem), but does mean if the caller gives up on the query, we don’t waste time sorting a result for someone who will never care about it. 这确实折腾了不少(用一个丑陋的方案解决了一个丑陋的问题),但它确实意味着如果调用方放弃了查询,我们就不必再浪费时间为那些根本不在乎结果的人进行排序了。