System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

系统设计系列:从宏观视角看 Apache Flink,以及构建基于 Flink 的推荐引擎

A deep dive into how Apache Flink works, why it exists, and learning it while building a real-time recommendation engine. 深入探讨 Apache Flink 的工作原理、存在意义,并在构建实时推荐引擎的过程中学习它。

For a while now, I’ve had Apache Flink on my “things I really need to understand properly” list. I’d seen it mentioned alongside Kafka, heard it come up in conversations about real-time pipelines, and sort-of understood the use case. But I’d never actually sat down and learned it properly. If you feel the same way, you’re in good company. 有一段时间,我一直把 Apache Flink 列在“我真正需要深入理解”的清单上。我曾在提到 Kafka 时看到过它,在关于实时流水线的讨论中听到过它,也大致了解它的应用场景。但我从未真正静下心来好好学习。如果你也有同感,那么你并不孤单。

There’s good reason to learn about Flink, it’s one of the most popular tools in software engineering right now. Netflix uses it for near-real-time anomaly detection in their streaming infrastructure. Alibaba reportedly runs one of the largest Flink deployments in the world — processing hundreds of billions of events per day across tens of thousands of machines. Uber built their analytical platform around it. Flink has become the backbone of how some of the most data-intensive companies in the world process information as it happens. 学习 Flink 是有充分理由的,它是目前软件工程中最流行的工具之一。Netflix 使用它在其流式基础设施中进行近实时的异常检测。据报道,阿里巴巴运行着全球最大的 Flink 部署之一,每天在数万台机器上处理数千亿条事件。Uber 围绕它构建了自己的分析平台。Flink 已成为全球一些数据密集型公司处理实时信息的支柱。

So if Flink has been on your list too, this is a good time to actually understand it. So I dove in. And I was honestly surprised, not just by what Flink is, but by why it exists and how it’s built. The story of Flink is really the story of a much deeper idea: the idea of how to understand high-scale, constantly streaming data. The problem statement is actually pretty simple: how do you build real-world and practical answers from massive scale of continuous data. This post is my attempt to explain that idea from the ground up, and show you where Flink fits into it. Let’s dive in. 所以,如果 Flink 也在你的清单上,现在正是深入了解它的好时机。我深入研究了它,坦率地说,我不仅对 Flink 本身感到惊讶,更对其存在的原因和构建方式感到震撼。Flink 的故事实际上是一个更深层理念的故事:即如何理解大规模、持续流式传输的数据。其核心问题其实很简单:如何从海量的连续数据中构建出真实且实用的答案。这篇文章是我从零开始解释这一理念的尝试,并向你展示 Flink 在其中的位置。让我们开始吧。

Before We Start

在开始之前

Two concepts come up constantly in this post that are worth making sure we’re on the same page about before we go further. 本文中会不断出现两个概念,在深入探讨之前,我们需要确保对它们有共识。

What is a stream? A stream is a continuous, potentially never-ending sequence of records arriving over time. Think about a user browsing a website — every page view, every click, every scroll is an event being produced. One after another, in real time. There’s no natural “end” to this — as long as the user is active, events keep coming. That’s a stream. 什么是流? 流是一个连续的、可能永无止境的记录序列,随时间推移不断到达。想象一下用户浏览网站的过程——每一次页面浏览、每一次点击、每一次滚动都是一个产生的事件。它们一个接一个地实时发生。这没有自然的“终点”——只要用户处于活跃状态,事件就会源源不断地产生。这就是流。

What is batch processing? Batch processing means taking a finite, bounded collection of data and processing it all at once. Instead of reacting to each event as it arrives, you collect events for a period of time — say, an hour — and then run a computation over all of them together. The computation has a clear start and a clear end. 什么是批处理? 批处理意味着获取一个有限的、有界的数据集合并一次性处理。你不是在每个事件到达时就做出反应,而是收集一段时间(比如一小时)内的事件,然后对它们进行统一计算。这种计算有明确的开始和结束。

Both are legitimate ways to process data. The tension between them is what Flink was built to resolve — and we’ll get there. 这两种都是处理数据的合法方式。它们之间的矛盾正是 Flink 所要解决的问题——我们稍后会讲到。

Back To The Problem: How We Actually Produce Data

回到问题:我们实际上是如何产生数据的

Let me make this concrete with an example we’ll use throughout this post. Imagine you’re building a recommendation engine — the kind that shows users “you might also like these” based on what they’ve been viewing. To do this well, your system needs to know things like: What has this user been clicking on in the last few minutes? What items are trending right now across all users? Which products did this user view but not purchase in the last session? 让我用一个贯穿全文的例子来具体说明。想象一下你正在构建一个推荐引擎——那种根据用户浏览记录向其展示“你可能也喜欢这些”的系统。为了做好这一点,你的系统需要知道:该用户在过去几分钟内点击了什么?目前所有用户中哪些商品是热门的?该用户在上次会话中浏览了哪些产品但未购买?

Now, where does that data come from? Every time a user opens a product page, you record an event. Every click, every purchase, every search — your application is continuously writing records that look roughly like this: 那么,这些数据从哪里来?每当用户打开产品页面,你就会记录一个事件。每一次点击、每一次购买、每一次搜索——你的应用程序都在持续写入大致如下的记录:

{ "user_id": "u-8821", "item_id": "p-443", "event_type": "view", "timestamp": "2024–03–10T14:32:01Z" }
{ "user_id": "u-1042", "item_id": "p-117", "event_type": "purchase", "timestamp": "2024–03–10T14:32:03Z" }
{ "user_id": "u-8821", "item_id": "p-501", "event_type": "click", "timestamp": "2024–03–10T14:32:07Z" }

One record every few seconds for every user, across millions of concurrent users, continuously. That’s your data. Not a file. Not a table that refreshes once a day. A stream — an ongoing, never-ending sequence of events that describes what your users are doing right now. 对于数百万并发用户中的每一位,每隔几秒就会产生一条记录,持续不断。这就是你的数据。它不是一个文件,也不是一个每天刷新一次的表。它是一个流——一个描述用户当前行为的、持续不断的事件序列。

And yet the dominant paradigm for years was to take that stream and… ignore the fact that it was a stream. Dump the events into files every hour. Wait for the batch job to run. Then serve recommendations based on what users were doing last hour. 然而,多年来的主流范式却是获取这个流,然后……忽略它是一个流的事实。每小时将事件转储到文件中,等待批处理作业运行,然后根据用户上一小时的行为提供推荐。

Why? Because batch processing is conceptually simple. You know exactly what data you have. You can reason about the computation clearly — it starts, it runs, it finishes. Systems like Hadoop and MapReduce were built around this model and scaled to enormous data sizes. They worked. 为什么?因为批处理在概念上很简单。你确切地知道你拥有什么数据。你可以清晰地推断计算过程——它开始、运行、结束。像 Hadoop 和 MapReduce 这样的系统就是围绕这种模型构建的,并扩展到了巨大的数据规模。它们确实有效。

But there’s a fundamental cost: latency. If your batch job runs every hour, then at worst case, a user’s behavior right now won’t influence their recommendations for up to an hour. For a recommendation engine, that means a user who just showed strong interest in hiking gear gets shown laptop accessories — because the system hasn’t caught up yet. The user searched for a hiking rucksack, and you need to show them tents and hiking poles on the next page load, not one hour later. 但它有一个根本性的代价:延迟。如果你的批处理作业每小时运行一次,那么在最坏的情况下,用户现在的行为要过一小时才会影响他们的推荐结果。对于推荐引擎来说,这意味着一个刚刚表现出对徒步装备浓厚兴趣的用户,却被推荐了笔记本电脑配件——因为系统还没跟上。用户搜索了徒步背包,你需要在下一次页面加载时向他们展示帐篷和登山杖,而不是一小时后。

For fraud detection, hourly latency means fraudulent transactions go undetected for an hour. For a live dashboard, it means your “real-time” metrics can be up to 59 minutes stale. The cost of batch is that events happen in real time, but your system only learns about them on a schedule. 对于欺诈检测,一小时的延迟意味着欺诈交易在一小时内无法被发现。对于实时仪表盘,这意味着你的“实时”指标可能会有长达 59 分钟的滞后。批处理的代价在于:事件是实时发生的,但你的系统只能按计划获知它们。

So as data volumes grew and latency requirements tightened, engineers started building streaming systems alongside their batch systems — systems that could process each event as it arrived, in milliseconds. Apache Storm was an early leader here. Amazon Kinesis. LinkedIn’s Samza. 因此,随着数据量的增长和延迟要求的提高,工程师们开始在批处理系统之外构建流式处理系统——这些系统可以在毫秒级处理每个到达的事件。Apache Storm 是早期的领导者,还有 Amazon Kinesis 和 LinkedIn 的 Samza。

But building a new streaming system, while maintaining an existing batch system, isn’t so straightforward. Now you have two systems to maintain. Your streaming pipeline computed approximate, real-time results… 但是,在维护现有批处理系统的同时构建一个新的流式系统并不简单。现在你需要维护两套系统。你的流式流水线计算出的是近似的实时结果……