Materialized Lake Views in Microsoft Fabric: When Your Medallion Fits in a SELECT Statement

Materialized Lake Views in Microsoft Fabric: When Your Medallion Fits in a SELECT Statement

Microsoft Fabric 中的物化湖仓视图 (Materialized Lake Views):当你的奖章架构(Medallion Architecture)可以装进一条 SELECT 语句时

Five surfaces collapsed into one declarative layer 五个界面合而为一:声明式架构层

For the longest time, building a medallion architecture in Microsoft Fabric meant stitching together a small orchestra of moving parts: notebooks for the transformations, pipelines for orchestration, schedules for refresh, custom code for data quality checks, and the Monitor Hub for keeping an eye on whether anything actually worked. Every layer worked – until something didn’t, and then you had to figure out which layer broke, why, and which downstream layers got affected along the way. If you’ve ever tried to debug a silver layer that didn’t update because the bronze notebook failed three hours ago, you know exactly what I’m talking about.

长期以来,在 Microsoft Fabric 中构建奖章架构(Medallion Architecture)意味着要将一系列零散的组件拼凑在一起:用于转换的 Notebook、用于编排的 Pipeline、用于刷新的调度程序、用于数据质量检查的自定义代码,以及用于监控系统是否正常运行的 Monitor Hub。每一层都能正常工作——直到某处出现故障,你必须找出是哪一层坏了、原因是什么,以及沿途哪些下游层受到了影响。如果你曾经尝试调试一个因为三小时前 Bronze 层 Notebook 失败而未更新的 Silver 层,你一定深有体会。

Then, at FabCon Atlanta in March 2026, materialized lake views (MLVs) went generally available. And the story they’re telling is simple: what if your entire medallion pipeline could be a few SELECT statements? Let me walk you through the whole thing – what they are, how they work, what changed between preview and GA, and where they fit (and where they don’t) in your architecture.

随后,在 2026 年 3 月的 FabCon Atlanta 大会上,物化湖仓视图(Materialized Lake Views, MLVs)正式发布(GA)。它们带来的理念很简单:如果你的整个奖章架构流水线只需要几条 SELECT 语句就能完成,会怎样?让我带你全面了解一下——它们是什么、如何工作、从预览版到正式版有哪些变化,以及它们在你的架构中处于什么位置(以及不适合什么位置)。

Materialized Lake View – WHAT?

什么是物化湖仓视图?

A materialized lake view is a persisted, automatically refreshed view defined in Spark SQL or PySpark. You write a SELECT query that describes the transformation you want, and Fabric takes care of execution, storage, refresh, dependency tracking, and data quality enforcement. The result is stored as a Delta table in your lakehouse. So downstream consumers, such as Power BI Direct Lake, Spark notebooks, SQL endpoints, can query it just like any other Delta table. No special handling, no different syntax. To put it in plain English: an MLV is nothing else but a SELECT statement that learned to materialize itself, manage its own dependencies, schedule its own refresh, and check its own data quality.

物化湖仓视图是一种由 Spark SQL 或 PySpark 定义的、持久化且自动刷新的视图。你只需编写一个描述所需转换的 SELECT 查询,Fabric 就会负责执行、存储、刷新、依赖跟踪和数据质量强制执行。结果会作为 Delta 表存储在你的湖仓(Lakehouse)中。因此,下游消费者(如 Power BI Direct Lake、Spark Notebook、SQL 端点)可以像查询任何其他 Delta 表一样查询它,无需特殊处理,也无需不同的语法。用通俗的话说:MLV 本质上就是一条学会了自我物化、管理自身依赖、调度自身刷新并检查自身数据质量的 SELECT 语句。

OK, that’s nice. But what does that actually replace?

听起来不错,但它到底取代了什么?

That’s a fair question. Before MLVs, building a single bronze-to-silver-to-gold flow looked roughly like this: you’d write a notebook for each transformation, set up a Data Factory pipeline to call them in the right order, configure schedules, build custom validation logic, and then wire up the Monitor Hub to watch for failures. Five different surfaces, five different things to debug when something breaks. With MLVs, all of that collapses into declarative SQL. You describe what you want. Fabric figures out the rest.

这是一个好问题。在 MLV 出现之前,构建一个从 Bronze 到 Silver 再到 Gold 的流程大致如下:你需要为每个转换编写 Notebook,设置 Data Factory 流水线按顺序调用它们,配置调度,构建自定义验证逻辑,然后连接 Monitor Hub 来监控故障。五个不同的界面,意味着出现问题时有五个不同的排查点。有了 MLV,这一切都简化为声明式 SQL。你只需描述你想要的结果,剩下的交给 Fabric 处理。

The four stages of an MLV’s life

MLV 生命周期的四个阶段

Every MLV moves through four stages. According to the Microsoft documentation, understanding them is the foundation for everything else:

  1. Create – You write the Spark SQL (or PySpark) that defines the transformation. Fabric stores the definition and materializes the initial result as a Delta table.
  2. Refresh – When source data changes, Fabric chooses the optimal strategy: incremental (process only changes), full (rebuild), or skip (no changes detected).
  3. Query – Any application or tool reads the materialized result. They don’t know – and don’t need to know – that it’s an MLV.
  4. Monitor – Refresh history, execution status, data quality metrics, and lineage are all tracked and visualised natively in Fabric.

每个 MLV 都会经历四个阶段。根据微软文档,理解这些阶段是后续一切工作的基础:

  1. 创建 (Create) – 你编写定义转换的 Spark SQL(或 PySpark)。Fabric 存储该定义并将初始结果物化为 Delta 表。
  2. 刷新 (Refresh) – 当源数据发生变化时,Fabric 会选择最优策略:增量刷新(仅处理变化部分)、全量刷新(重建)或跳过(未检测到变化)。
  3. 查询 (Query) – 任何应用程序或工具读取物化后的结果。它们不知道(也不需要知道)这是一个 MLV。
  4. 监控 (Monitor) – 刷新历史、执行状态、数据质量指标和血缘关系都在 Fabric 中原生跟踪和可视化。

Create: the syntax

创建:语法

Here’s the full Spark SQL pseudo-code syntax for creating an MLV, straight from the Microsoft Learn reference: 以下是创建 MLV 的完整 Spark SQL 伪代码语法,摘自 Microsoft Learn 参考文档:

CREATE [OR REPLACE] MATERIALIZED LAKE VIEW [IF NOT EXISTS] [workspace.lakehouse.schema].MLV_Identifier 
[(CONSTRAINT constraint_name CHECK (condition) [ON MISMATCH DROP | FAIL], ...)] 
[PARTITIONED BY (col1, col2, ...)] 
[COMMENT “description”] 
[TBLPROPERTIES (”key1”=”val1”, ...)] 
AS select_statement

A real example – cleaning order data joined from products and orders, with a data quality constraint and partitioning: 一个实际示例——清理从产品表和订单表关联而来的订单数据,并包含数据质量约束和分区:

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.cleaned_order_data 
( 
  CONSTRAINT valid_quantity CHECK (quantity > 0) ON MISMATCH DROP 
) 
PARTITIONED BY (category) 
COMMENT “Cleaned order data joined from products and orders” 
AS 
SELECT p.productID, p.productName, p.category, o.orderDate, o.quantity, o.totalAmount 
FROM bronze.products p 
INNER JOIN bronze.orders o ON p.productID = o.productID

Two things worth flagging right away. First, MLV names are case-insensitive (MyView becomes myview). Second, all-uppercase schema names (like MYSCHEMA) aren’t supported, so use either mixed or lowercase. You also need a schema-enabled lakehouse and Fabric Runtime 1.3 or higher. If your lakehouse doesn’t have schemas turned on, MLVs aren’t available, that’s the very first prerequisite.

有两点需要立即注意。首先,MLV 名称不区分大小写(MyView 会变成 myview)。其次,不支持全大写的架构名称(如 MYSCHEMA),因此请使用混合大小写或小写。此外,你需要一个启用了架构(Schema)功能的湖仓,且 Fabric Runtime 版本需在 1.3 或以上。如果你的湖仓未开启架构功能,则无法使用 MLV,这是最基本的前提条件。

Refresh: the brain of MLVs

刷新:MLV 的大脑

Here’s where MLVs stop being clever and start being smart. When source data changes, Fabric’s optimal refresh engine looks at every MLV in the lineage and asks a series of questions: Did anything actually change? Can I process just the changes? Or do I need to rebuild from scratch? Three possible outcomes:

  1. Skip refresh – source data hasn’t changed. Don’t waste compute. Move on.
  2. Incremental refresh – process only the new or changed rows. Fast, cheap, ideal.
  3. Full refresh – rebuild the whole thing. Slowest path, used when incremental isn’t safe or possible.

这是 MLV 从“聪明”变得“智能”的地方。当源数据发生变化时,Fabric 的最优刷新引擎会查看血缘关系中的每个 MLV,并提出一系列问题:真的有变化吗?我能只处理变化的部分吗?还是需要从头重建?有三种可能的结果:

  1. 跳过刷新 – 源数据未改变。无需浪费计算资源,直接跳过。
  2. 增量刷新 – 仅处理新增或修改的行。快速、低成本,是理想情况。
  3. 全量刷新 – 重建整个表。这是最慢的路径,仅在无法进行增量刷新或增量刷新不安全时使用。

But, and this is important, incremental refresh isn’t free. It has prerequisites:

  • The Delta change data feed (CDF) must be enabled on every source table referenced by the MLV (delta.enableChangeDataFeed=true).
  • The source must be a Delta table. Non-Delta sources always get a full refresh.
  • The data must be append-only. If your source has updates or deletes, Fabric falls back to a full refresh.
  • The query must use only supported SQL constructs.

但是,有一点很重要:增量刷新并非免费,它有以下前提条件:

  • MLV 引用的每个源表都必须启用 Delta 变更数据馈送(CDF)(delta.enableChangeDataFeed=true)。
  • 源表必须是 Delta 表。非 Delta 源表始终会触发全量刷新。
  • 数据必须是仅追加(Append-only)的。如果源数据包含更新或删除操作,Fabric 将回退到全量刷新。
  • 查询必须仅使用受支持的 SQL 结构。

Without CDF enabled, optimal refresh can only choose between skip and full. With CDF on, the full incremental path opens up. Enabling CDF on your source tables has no measurable storage or performance impact for append-only workloads, so there’s very little reason not to.

如果不启用 CDF,最优刷新只能在“跳过”和“全量”之间选择。开启 CDF 后,才能实现完整的增量刷新路径。对于仅追加的工作负载,在源表上启用 CDF 不会对存储或性能产生可感知的负面影响,因此几乎没有理由不开启它。