When Impressive Performance Gains Do Not Matter

当令人印象深刻的性能提升毫无意义时

Of anything I’ve worked on in my career, performance work has been the most rewarding. I enjoy making systems more efficient, especially when it opens up brand new possibilities for customers. I also find developing an empirical understanding of systems is one of the best ways to learn how systems work from first principles, especially how complex systems interact, at scale, or under load. But one of the greatest benefits of performance work is the creativity that comes from working intimately with systems. Through performance work, I find people develop a wealth of ideas for how products and services can be improved, most of which are not even related to performance optimization. While improving performance always feels good, impressive claims like “10 times faster” or “an order-of-magnitude more efficient” or “fifty percent fewer resources” may not have the impact you anticipate due to constraints that are not always obvious or intuitive. This is an essay about three of those constraints.

在我职业生涯所从事的所有工作中，性能优化是最有成就感的。我喜欢让系统变得更高效，尤其是当这种提升为客户开启了全新的可能性时。我还发现，通过实证研究来理解系统，是掌握系统底层原理的最佳途径之一，特别是了解复杂系统如何在规模化或高负载下进行交互。但性能工作最大的益处之一，是与系统深度接触所激发的创造力。通过性能优化，我发现人们会产生大量关于如何改进产品和服务的想法，其中大多数甚至与性能优化本身无关。虽然提升性能总是令人愉悦，但诸如“快了 10 倍”、“效率提升一个数量级”或“资源消耗减少 50%”这样令人印象深刻的声明，由于受到一些并不总是显而易见或直观的约束，可能并不会产生你预期的影响。本文将探讨其中三个约束条件。

Attention Threshold

注意力阈值

Recently, I worked on improving the query performance of a new database that returns data to a user interface for graphing and interactive analysis. We were developing the new database with the goal of improving response time by an order-of-magnitude over the existing database that had been used for many years. The most expensive queries against the old database took between 5 and 10 minutes. After months of difficult engineering, we got the same queries to complete between 30 seconds and 1 minute—an order-of-magnitude improvement. A presentation to management highlighting these performance gains would look very impressive—queries that used to take 10 minutes now return in 1 minute. However, I insisted it wouldn’t have the impact we wanted unless we squeezed out an additional order-of-magnitude.

最近，我致力于改进一个新数据库的查询性能，该数据库负责向用户界面返回数据以进行绘图和交互式分析。我们开发这个新数据库的目标是，相比已使用多年的现有数据库，将响应时间提升一个数量级。旧数据库中最耗时的查询需要 5 到 10 分钟。经过数月的艰苦攻关，我们将同样的查询时间缩短到了 30 秒到 1 分钟之间——这确实是一个数量级的提升。向管理层展示这些性能提升看起来会非常亮眼——原本需要 10 分钟的查询现在只需 1 分钟。然而，我坚持认为，除非我们再挤出一个数量级的提升，否则它不会产生我们预期的影响。

Human-factors research identifies 10 seconds as the limit for keeping someone’s attention. For delays longer than this, people will perform other tasks while they wait. Therefore, even though a query that used to take 5 minutes now took 30 seconds, both were well above the 10-second threshold of attention. In both cases, people will context-switch—check their messages, go for coffee, start a conversation, start another task. When they finally return their attention a few minutes or hours later, the user interface will have loaded, but the time it actually took is immaterial. Ultimately, if we could not complete queries in under 10 seconds, our performance improvements would not have an impact on changing the way people work. In complex systems, improving performance by an order of magnitude is often an incredibly difficult feat. Sadly, we needed another order-of-magnitude improvement—queries had to complete in under 10 seconds to hold users’ attention.

人机工程学研究指出，10 秒是维持一个人注意力的极限。对于超过这个时间的延迟，人们在等待时会去处理其他任务。因此，尽管原本需要 5 分钟的查询现在缩短到了 30 秒，但两者都远高于 10 秒的注意力阈值。在这两种情况下，人们都会进行上下文切换——查看消息、去喝杯咖啡、开始一段对话或处理另一项任务。当他们几分钟或几小时后重新关注时，用户界面虽然已经加载完毕，但它实际花费的时间已无关紧要。归根结底，如果我们不能在 10 秒内完成查询，我们的性能提升就无法改变人们的工作方式。在复杂系统中，将性能提升一个数量级通常是一项极其艰巨的任务。遗憾的是，我们需要再提升一个数量级——查询必须在 10 秒内完成，才能留住用户的注意力。

Going From One to Two

从一到二的跨越

Years ago, I worked on a project where we made incredible gains in efficiency by automating manual tasks, removing unnecessary steps, parallelizing parts of the process, and deferring steps that could be completed later, asynchronously. It improved the overall process from a few hours to reliably under an hour—somewhere between a 25 to 50 percent improvement. We were understandably excited about this impact. As it turned out, this improvement in software performance didn’t impact the overall process because it was constrained by logistics. To demonstrate, consider a plumber, an electrician, or a carpenter. They each need to schedule work at a location, travel to that location, and then complete the work. For the sake of argument, if they work 8 hours in a day, and it takes 8 hours to complete the work at a location, then it doesn’t really matter if a process improvement just saved 2 or 3 hours, because there still isn’t enough time to travel to a new location and complete a new job. If you can’t get each job below 4 hours, including travel time, then you can’t complete two in a day. Breaching thresholds like this can be incredibly difficult and the efficiency gains along the way don’t pay off until you do. Going from one to two can be incredibly hard.

几年前，我参与了一个项目，通过自动化手动任务、移除不必要的步骤、并行化部分流程以及异步推迟非必要步骤，我们在效率上取得了惊人的提升。它将整体流程从几个小时缩短到了稳定在 1 小时以内——大约有 25% 到 50% 的提升。我们对这一成果感到非常兴奋。但事实证明，这种软件性能的提升并没有影响整体流程，因为它受到了物流（物理限制）的制约。举个例子，考虑一下水管工、电工或木匠。他们每个人都需要安排工作地点、前往该地点，然后完成工作。假设他们一天工作 8 小时，而在一个地点完成工作需要 8 小时，那么流程改进节省了 2 或 3 小时其实并不重要，因为剩下的时间依然不足以让他们前往下一个地点并完成新工作。如果你不能将每项工作（包括路程时间）压缩到 4 小时以内，你就无法在一天内完成两项工作。突破这样的阈值可能极其困难，而在达到目标之前，沿途的效率提升并不能带来实质回报。从一到二的跨越可能非常艰难。

Backpressure in Pipelines

流水线中的背压

The software infrastructure for many businesses includes data pipelines where events are produced from many different sources—vehicles, factory equipment, mobile phones, financial transactions—then processed reliably to drive many other services and applications. The events are usually persisted to a durable log from which downstream services consume and process events. To achieve high throughput at scale, the log must be partitioned and the downstream services use techniques like batching, pipelining, parallelism, efficient memory allocation, dynamic scaling, and more. Performance bottlenecks in data pipelines can be hard to find because the system dynamics are correlated. A slow stage in the pipeline will backpressure to the upstream stages, by design. If there are multiple bottlenecks in the pipeline—and with these systems, this is common—the overall throughput will not improve until every last bottleneck is removed. It is a good engineering practice to break pipelines into stages and understand the performance dynamics and limitations of each stage. But many times I have seen engineers disappointed when they improve a single stage by many orders of magnitude only to see it have no effect on the overall throughput. If you are going to make throughput improvements to pipelines, the number that matters is the end-to-end throughput.

许多企业的软件基础设施都包含数据流水线，事件从车辆、工厂设备、手机、金融交易等多种来源产生，然后被可靠地处理以驱动其他服务和应用程序。这些事件通常会被持久化到一个耐用日志中，下游服务再从该日志中消费和处理事件。为了实现大规模的高吞吐量，日志必须进行分区，下游服务则使用批处理、流水线、并行化、高效内存分配、动态扩缩容等技术。数据流水线中的性能瓶颈往往难以发现，因为系统动态是相互关联的。根据设计，流水线中较慢的阶段会向其上游阶段产生“背压”。如果流水线中存在多个瓶颈（在这些系统中这很常见），那么在移除每一个瓶颈之前，整体吞吐量都不会提升。将流水线拆分为多个阶段并理解每个阶段的性能动态和局限性是一种良好的工程实践。但我多次看到工程师们感到失望，因为他们将某个单一阶段的性能提升了几个数量级，结果却发现对整体吞吐量毫无影响。如果你打算提升流水线的吞吐量，唯一重要的指标是端到端的吞吐量。

Conclusion

结论

Performance work can be incredibly challenging, but it is also a discipline for intimately understanding complex systems and engineering better products. Just be sure that incredible gains in performance actually have the desired outcomes. If you need to hold people’s attention, you only have about 10 seconds. If whole increments are a constraint, percentage gains are not enough, you need to be able to go from one to two. To maximize throughput in pipelines that backpressure, often…

性能工作可能极具挑战性，但它也是深入理解复杂系统和设计更好产品的必修课。请务必确保那些惊人的性能提升确实能带来预期的结果。如果你需要留住人们的注意力，你只有大约 10 秒的时间。如果“整数增量”是一个约束条件，那么百分比的提升是不够的，你需要能够实现从一到二的跨越。为了最大化存在背压的流水线的吞吐量，通常……