How I Ensure My Application Scales

我是如何确保应用程序可扩展的

During a job interview I was explaining my day-to-day responsibilities and how I ensure quality on my projects, then I mentioned - “I check that my application scales” - The interviewer then asked - “How do you make sure your application scales?” - I froze, I didn’t have a structured answer and I got blocked because I was not prepared to answer that question. So this post is a retrospective about that experience and outlines a framework for thinking about scalability when working on new features. 在一次求职面试中，我正在解释我的日常职责以及如何确保项目质量，当时我提到：“我会检查我的应用程序是否具备可扩展性。”面试官随即问道：“你是如何确保应用程序具备可扩展性的？”我当时愣住了，因为没有准备好结构化的答案，导致卡壳了。因此，这篇文章是对那次经历的回顾，并概述了一个在开发新功能时思考可扩展性的框架。

Why is it difficult to answer this question? Scalability has too many dimensions like Traffic, Throughput, Data Size, Latency, etc. We can have a critical flow that has only 10 concurrent users, but these 10 work using really large datasets with terabytes of data, or we can have another flow where we have thousands to millions of requests per second. Each use case will have to optimize for different things. With that in mind, let’s explore good practices that help us build scalable systems. 为什么这个问题很难回答？可扩展性包含太多维度，如流量、吞吐量、数据规模、延迟等。我们可能有一个关键流程只有 10 个并发用户，但这些用户处理的是 TB 级的大型数据集；或者另一个流程每秒有成千上万甚至数百万次请求。每个用例都需要针对不同的目标进行优化。考虑到这一点，让我们探讨一些有助于构建可扩展系统的良好实践。

1. Define the problem and the scope.

1. 定义问题与范围

Before talking about QPS, latency, costs, etc, we need to fully understand the problem scope. This is not only telling us what is important to implement, this guides how we design the solution because there is rarely one single right answer, but during this process we define what is important and we can establish objectives and what matters the most. Defining how success looks like, pretty much involves defining SLOs, SLAs, and KPIs, This provides clarity on what to optimize for. 在讨论 QPS（每秒查询率）、延迟、成本等之前，我们需要充分理解问题的范围。这不仅告诉我们实现什么才是重要的，还能指导我们如何设计解决方案，因为很少有唯一的正确答案。在这个过程中，我们定义了什么是重要的，并能确立目标和优先级。定义成功的标准，很大程度上涉及定义 SLO（服务水平目标）、SLA（服务水平协议）和 KPI（关键绩效指标），这能让我们明确优化的方向。

2. Identify bottlenecks.

2. 识别瓶颈

Once we understand what matters most, we start making estimations. This helps us understand the impact of our new feature and we can start verifying our systems can handle it. Example scenarios: Will the downstream service be able to absorb additional 10,000 QPS? A new spark job will create thousands of records per second. Can the datastore sustain the expected throughput? When data size grows, does the cost of fetching a record grow with it? My feature will use an LLM, how can I optimize the token usage to maximize ROI? 一旦明确了重点，我们就可以开始进行估算。这有助于我们了解新功能的影响，并开始验证系统是否能够承受。示例场景包括：下游服务能否吸收额外的 10,000 QPS？一个新的 Spark 任务每秒会创建数千条记录，数据存储能否维持预期的吞吐量？当数据规模增长时，获取记录的成本是否随之增加？我的功能将使用大语言模型（LLM），如何优化 Token 使用以最大化投资回报率（ROI）？

3. Beware of premature optimization.

3. 警惕过早优化

I know sometimes we are excited about the next unicorn idea and believe in the great potential of the things we are building, and that optimism is fine, but when building things I highly suggest that you optimize for yourself or a small number of users, test your idea and get data. This will help us validate assumptions, understand growth patterns, and invest in scalability only when the data justifies it. 我知道有时我们对下一个“独角兽”创意感到兴奋，并相信我们正在构建的事物具有巨大潜力，这种乐观态度固然好，但在构建产品时，我强烈建议先针对自己或少量用户进行优化，测试你的想法并获取数据。这将帮助我们验证假设、了解增长模式，并仅在数据证明有必要时才投入资源进行扩展。

4. Analyze complexity.

4. 分析复杂度

When talking about Big O notation, it is hard not to think about LeetCode or Software Engineering interviews, but one of the reasons it is important to know Big O notation is scalability. Let me explain using one example of this: Imagine that you have a SQL database, a table to call appointments, the table that has a primary key, start and end datetimes, and other relevant information for the appointments. And you would like to bring all the appointments for next week. What would the time complexity look like? 提到大 O 表示法（Big O notation），很难不让人联想到 LeetCode 或软件工程面试，但了解大 O 表示法对于可扩展性至关重要。让我用一个例子来解释：假设你有一个 SQL 数据库，其中有一张预约表，包含主键、开始和结束时间以及其他相关信息。如果你想获取下周的所有预约，时间复杂度会是怎样的？

The appointments table doesn’t have an index on the start datetimes: This search will require a full-table scan, so the time complexity for the search is O(N), where N is the size of the appointments table. At the beginning this might not be an issue, but the more data you have, it will require scanning over each appointment to evaluate the filter condition, additionally, I/O and memory usage will be impacted. 预约表在开始时间上没有索引： 这种搜索需要全表扫描，因此搜索的时间复杂度为 O(N)，其中 N 是预约表的大小。起初这可能不是问题，但随着数据量增加，系统需要扫描每一条预约记录来评估过滤条件，此外，I/O 和内存使用也会受到影响。
The table has a B+Tree index over the start date: This will reduce our time complexity to O(log N + K), N being the size of the dataset while K is the number of rows returned, This is usually an acceptable performance and can scale much better than not having an index. 表在开始日期上建立了 B+ 树索引： 这会将时间复杂度降低到 O(log N + K)，其中 N 是数据集大小，K 是返回的行数。这通常是可接受的性能，并且比没有索引的情况具有更好的可扩展性。

5. Think about trade-offs.

5. 权衡利弊

Consider an Event-Driven architecture: Using events can help us to optimize user-facing latency by moving expensive work out of the synchronous path (for example: a request to an LLM), but comes with some complexity: Increased overall latency, network issues, lagging, dropped events, etc. So I would consider critically when it is the right time to invest in an event-driven architecture, dealing with all the trade-offs that come with it, making sure it provides a much better experience maintaining the platform. Remember point #2 (Beware of premature optimization). 考虑事件驱动架构：使用事件可以通过将昂贵的工作（例如：对 LLM 的请求）移出同步路径来优化用户感知的延迟，但这会带来一些复杂性：整体延迟增加、网络问题、滞后、事件丢失等。因此，我会审慎考虑何时是投资事件驱动架构的正确时机，处理随之而来的所有权衡，并确保它能显著改善平台维护体验。记住第 2 点（警惕过早优化）。

6. Measure, Validate, and Iterate.

6. 测量、验证与迭代

We understood the problem, defined what is most important, and implemented our solution, but we are not “firing and forget”, we need to set up metrics, alerts and dashboards, this will help us to monitor and validate whether we are meeting our SLOs though incremental rollouts of the new feature, then compare and act when necessary. After everything is set up now, scalability becomes an ongoing process of measuring, learning, and adapting. Production data puts us in a better position to perform capacity planning, understanding the organic growth, costs, and ROI, as we now have a real perspective about the service. 我们理解了问题，定义了重点，并实施了解决方案，但这并不意味着“一劳永逸”。我们需要设置指标、警报和仪表板，这将帮助我们通过新功能的增量发布来监控和验证是否达到了 SLO，并在必要时进行比较和采取行动。一切设置完成后，可扩展性就变成了一个持续测量、学习和适应的过程。生产环境的数据使我们能够更好地进行容量规划，了解自然增长、成本和投资回报率，因为我们现在对服务有了真实的视角。

Conclusion

结论

Scaling requires critical thinking about what we are building, understanding the dimensions, and evaluating the ROI on the proposed architecture. Remember that there is not always a single right answer when designing the architecture of our projects, but we need to have clarity of what we are building, ensuring that the benefits outweigh the operational and engineering costs. Scalability is not a feature you add at the end. It is a continuous process of understanding constraints, making trade-offs, and validating assumptions. 扩展需要对我们正在构建的内容进行批判性思考，理解各个维度，并评估所提议架构的投资回报率。请记住，在设计项目架构时，并不总是有唯一的正确答案，但我们需要清楚自己正在构建什么，确保收益大于运营和工程成本。可扩展性不是最后才添加的功能，而是一个理解约束、进行权衡和验证假设的持续过程。