Deploying a Multistage Multimodal Recommender System on Amazon Elastic Kubernetes Service

在 Amazon Elastic Kubernetes Service 上部署多阶段多模态推荐系统

Machine Learning Deploying a Multistage Multimodal Recommender System on Amazon Elastic Kubernetes Service Featuring Bloom filters, feature caching, contextual ranking, and an end‑to‑end pipeline from data preparation to model serving. Mustapha Momoh May 19, 2026 20 min read Share 机器学习：在 Amazon Elastic Kubernetes Service 上部署多阶段多模态推荐系统。本文涵盖布隆过滤器（Bloom filters）、特征缓存、上下文排序，以及从数据准备到模型服务的端到端流水线。作者：Mustapha Momoh，2026年5月19日，阅读时长20分钟。

Figure 1: Architecture of the Multistage Recommender System deployed on Amazon EKS. Image by author, inspired by prior work from Even Oldridge and Karl Byleen-Higley, and from Sam, Tyler, and Nathan) 图 1：部署在 Amazon EKS 上的多阶段推荐系统架构。图片由作者制作，灵感来源于 Even Oldridge、Karl Byleen-Higley 以及 Sam、Tyler 和 Nathan 的前期工作。

Building a production multistage, multimodal recommender system is not trivial especially when it needs to scale, adapt in near real time, and run reliably on cloud. In this post, I walk through my experience designing and deploying such a system end‑to‑end covering data preparation, model training to serving the models in production. We’ll explore the full pipeline including retrieval, filtering, scoring, and ranking along with the infrastructure and important decisions that makes it all work. This includes feature stores, Bloom‑filters, Kubeflow, near real‑time preference adaptation, and a major latency win from in‑memory feature caching. It’s a long read, but if you’re building or scaling recommender systems, you’ll find practical patterns here that you can apply directly to your own projects. 构建一个生产级的多阶段、多模态推荐系统并非易事，尤其是在需要扩展、近实时适应并能在云端可靠运行的情况下。在本文中，我将分享我设计和部署此类系统的端到端经验，涵盖从数据准备、模型训练到生产环境模型服务的全过程。我们将探索完整的流水线，包括检索、过滤、评分和排序，以及支撑这一切的基础设施和关键决策。这包括特征存储、布隆过滤器、Kubeflow、近实时偏好适应，以及通过内存特征缓存带来的显著延迟优化。这是一篇长文，但如果你正在构建或扩展推荐系统，你将在这里找到可以直接应用于自己项目的实用模式。

Some information about the system

关于系统的一些信息

The recommender system consists of four main stages: a Two-Tower model generates candidates, a Bloom filter temporarily hides items the user recently interacted with, a DLRM ranker scores the remaining items using user, item, and context features, and a final reranking stage orders and samples from these scores to produce the final recommendations. The models use both tabular collaborative features and precomputed CLIP image embeddings and Sentence-BERT text embeddings. In the retrieval model, these pretrained embeddings are fed into the candidate tower together with learned item features, providing the candidate tower with both content-based semantic signals and collaborative signals. The dot product between the query-tower output and candidate-tower output is then used as a learned relevance score in this shared embedding space. In the DLRM ranker, the pretrained image and text embeddings participate in the dot-product interaction layer. These pairwise interactions are then passed to the top MLP, allowing content-based signals from the pretrained embeddings to complement the collaborative and contextual signals used for click prediction. 该推荐系统由四个主要阶段组成：双塔模型（Two-Tower model）生成候选集；布隆过滤器暂时隐藏用户最近交互过的物品；DLRM 排序器使用用户、物品和上下文特征对剩余物品进行评分；最后的重排序阶段对这些分数进行排序和采样，以生成最终推荐。模型同时使用了表格协同特征、预计算的 CLIP 图像嵌入和 Sentence-BERT 文本嵌入。在检索模型中，这些预训练的嵌入与学习到的物品特征一起输入到候选塔中，为候选塔提供基于内容的语义信号和协同信号。查询塔输出与候选塔输出之间的点积被用作该共享嵌入空间中学习到的相关性分数。在 DLRM 排序器中，预训练的图像和文本嵌入参与点积交互层。这些成对交互随后被传递到顶层 MLP，使来自预训练嵌入的基于内容的信号能够补充用于点击预测的协同和上下文信号。

Why the current design was chosen

为什么选择当前的设计

The target use case is an ecommerce platform that needs to recommend relevant products as soon as users land on the homepage. The platform serves both registered users and anonymous visitors, and user behavior can vary substantially with the request context, such as device type, time of day, or day of week. That means the recommendation service must provide reasonable cold-start recommendations for new users and must adapt recommendations to the context of the current request. The solution also needs to scale. As more retailers are onboarded, the product catalog could grow to millions of items. At that point, scoring the full catalog on every request is impractical. A multistage design solves this problem by using a light weight retrieval stage to fetch candidates quickly and a heavier ranking stage to score those candidates. Also, the recommendation models need to stay up to date with new interactions, however rebuilding the full retrieval stack every day is not practical. For this reason, two Kubeflow pipelines are defined. The first pipeline sets up the preprocessing workflows, trains the models from scratch, builds the ANN index, and deploys the Triton server and models. The second pipeline manages daily finetuning which primarily updates the query tower and the ranker; the models are updated with new interaction signals but the item embeddings are not regenerated. 目标用例是一个电子商务平台，需要在用户进入首页时立即推荐相关产品。该平台同时服务于注册用户和匿名访客，且用户行为会随请求上下文（如设备类型、时间或星期几）发生显著变化。这意味着推荐服务必须为新用户提供合理的冷启动推荐，并必须根据当前请求的上下文调整推荐。该解决方案还需要具备扩展性。随着更多零售商加入，产品目录可能会增长到数百万个。届时，在每次请求时对整个目录进行评分是不切实际的。多阶段设计通过使用轻量级检索阶段快速获取候选集，并使用较重的排序阶段对这些候选集进行评分，从而解决了这个问题。此外，推荐模型需要根据新的交互保持更新，但每天重建整个检索栈是不切实际的。因此，定义了两个 Kubeflow 流水线：第一个流水线设置预处理工作流、从头开始训练模型、构建 ANN 索引并部署 Triton 服务器和模型；第二个流水线管理每日微调，主要更新查询塔和排序器——模型会根据新的交互信号进行更新，但不会重新生成物品嵌入。

System components

系统组件

All components of the system work together to ensure the overall goal of serving relevant recommendations fast and at reasonable scale is achieved. Kubeflow Pipelines manages both the full training workflow and the daily fine-tuning workflow on the Kubernetes-based system. The NVIDIA Merlin stack handles GPU-accelerated feature engineering, preprocessing, training retrieval and ranking models. Triton Inference server hosts the multistage serving graph as a single ensemble model. FAISS serves as the approximate nearest neighbor index for candidate retrieval. Feast manages the user and item features across training and serving. ElastiCache for Valkey (Redis) backs the online feature store, manages each user’s Bloom filter to allow filtering of already-seen items from a user’s recommendation list, and stores global and category-based item popularity information based on interaction counts. Amazon Athena (with S3 and Glue) backs the offline feature store. Amazon Elastic Kubernetes Service (EKS) runs the containerized machine learning workflows and scales compute to meet changing workload demands. 系统的所有组件协同工作，以确保实现快速且以合理规模提供相关推荐的总体目标。Kubeflow Pipelines 在基于 Kubernetes 的系统上管理全量训练工作流和每日微调工作流。NVIDIA Merlin 栈处理 GPU 加速的特征工程、预处理以及检索和排序模型的训练。Triton Inference Server 将多阶段服务图托管为单个集成模型。FAISS 作为候选检索的近似最近邻索引。Feast 管理训练和服务过程中的用户和物品特征。ElastiCache for Valkey (Redis) 支持在线特征存储，管理每个用户的布隆过滤器以过滤掉用户推荐列表中已查看的物品，并根据交互计数存储全局和基于类别的物品流行度信息。Amazon Athena（配合 S3 和 Glue）支持离线特征存储。Amazon Elastic Kubernetes Service (EKS) 运行容器化的机器学习工作流，并扩展计算资源以满足不断变化的工作负载需求。

Figure 2: Recommender system MLOps with Kubeflow on Amazon Elastic Kubernetes Service (image by author) 图 2：在 Amazon Elastic Kubernetes Service 上使用 Kubeflow 的推荐系统 MLOps（图片由作者制作）

Data source

数据源

The training data comes from a modified version of the AWS Retail Demo Store interaction generator. The user pool was scaled to 300,000 while the product catalog was kept at 2,465 items, with the associated images and descriptions. The dataset contains 13 million interactions across 14 days, stored… 训练数据来自 AWS Retail Demo Store 交互生成器的修改版本。用户池扩展至 300,000，产品目录保持在 2,465 个物品，并附带相关的图像和描述。该数据集包含 14 天内产生的 1,300 万次交互，存储在……