ASF Project Spotlight: Apache Iceberg
ASF Project Spotlight: Apache Iceberg
ASF 项目聚焦:Apache Iceberg
Dipankar Mazumdar is the Director of Developer Relations at Cloudera, leading global developer initiatives across lakehouse architecture and AI. He previously held advocacy and engineering roles at Dremio, Onehouse, and Qlik, contributing to open source projects including Apache Iceberg, Apache Hudi, and Apache XTable (incubating) and building communities. His work focuses on the intersection of data engineering and AI. He is the author of Engineering Lakehouses with Open Table Formats and a contributor to Apache Iceberg: The Definitive Guide. Apache Iceberg has quickly become a foundational technology in modern data architectures—but its impact goes far beyond performance and scale. This conversation with Dipankar explores how Iceberg redefined the data lake, and how community, education, and open collaboration fueled its adoption.
Dipankar Mazumdar 是 Cloudera 的开发者关系总监,负责领导湖仓一体架构和人工智能领域的全球开发者计划。他此前曾在 Dremio、Onehouse 和 Qlik 担任技术推广和工程职务,为 Apache Iceberg、Apache Hudi 和 Apache XTable(孵化中)等开源项目做出贡献,并致力于社区建设。他的工作重点是数据工程与人工智能的交叉领域。他是《Engineering Lakehouses with Open Table Formats》一书的作者,也是《Apache Iceberg: The Definitive Guide》的贡献者。Apache Iceberg 已迅速成为现代数据架构中的基础技术,但其影响远不止于性能和规模。本次与 Dipankar 的对话探讨了 Iceberg 如何重新定义数据湖,以及社区、教育和开放协作如何推动了它的普及。
What Is Apache Iceberg and Why It Exists
什么是 Apache Iceberg,它为何存在?
Q: Can you tell us a bit about Apache Iceberg? 问:你能简单介绍一下 Apache Iceberg 吗?
A: Apache Iceberg is a high-performance open table format for huge analytic datasets. It was designed to bring reliability and simplicity to data lakes, allowing multiple engines to safely read and write to the same datasets with strong guarantees. By introducing a table abstraction on top of raw data files, Iceberg helps organizations manage large-scale data with the consistency typically associated with data warehouses while retaining the flexibility of data lakes.
答: Apache Iceberg 是一种用于海量分析数据集的高性能开放表格式。它的设计初衷是为数据湖带来可靠性和简洁性,允许不同的计算引擎在强一致性保证下安全地读写同一数据集。通过在原始数据文件之上引入表抽象层,Iceberg 帮助企业在保持数据湖灵活性的同时,实现通常只有数据仓库才具备的数据一致性管理。
Q: What did the data ecosystem look like before Apache Iceberg, and what gaps did you see early on? 问:在 Apache Iceberg 出现之前,数据生态系统是什么样的?你早期发现了哪些差距?
A: Before Iceberg, most data lakes relied on tables built on technologies like Apache Hive, with Apache Parquet as the underlying storage format. These systems worked well in the Hadoop era, but as workloads diversified and organizations were moving to cloud object stores (like Amazon S3), a number of structural limitations began to surface. Updates were unreliable, partitioning strategies were brittle, and schema evolution was difficult to manage. Metadata handling also became increasingly expensive, especially with large numbers of files, and query performance would degrade over time. At the same time, data warehouses abstracted all of this away behind proprietary systems, so many engineers never had to think about these problems directly. However, they were dealing with significant cost issues and entering a vendor-locked environment. These limitations/issues in both data lakes and warehouses made it clear that a new approach was needed that treated tables as first-class objects rather than just collections of files.
答: 在 Iceberg 出现之前,大多数数据湖依赖于基于 Apache Hive 等技术构建的表,并以 Apache Parquet 作为底层存储格式。这些系统在 Hadoop 时代运行良好,但随着工作负载的多样化以及企业向云对象存储(如 Amazon S3)迁移,许多结构性限制开始显现。更新操作不可靠,分区策略脆弱,模式(Schema)演进难以管理。元数据处理也变得越来越昂贵,特别是在文件数量庞大的情况下,查询性能会随时间推移而下降。与此同时,数据仓库通过专有系统屏蔽了所有这些复杂性,因此许多工程师从未直接考虑过这些问题。然而,他们却面临着巨大的成本问题,并陷入了供应商锁定的困境。数据湖和数据仓库中的这些局限性/问题清楚地表明,我们需要一种新的方法,将表视为“一等公民”对象,而不仅仅是文件的集合。
From Netflix to Apache: How Iceberg Took Shape
从 Netflix 到 Apache:Iceberg 是如何成型的
Q: When was Iceberg started and why? 问:Iceberg 是何时启动的,原因是什么?
A: Iceberg was originally developed at Netflix to address these large-scale data challenges. The team needed a solution that could handle massive datasets reliably while supporting evolving data requirements. Recognizing that these challenges were industry-wide, the project was later open sourced and contributed to The Apache Software Foundation (ASF) in 2018 to foster broader collaboration and adoption.
答: Iceberg 最初是在 Netflix 开发的,旨在解决这些大规模数据挑战。团队需要一个能够可靠地处理海量数据集,同时支持不断变化的数据需求的解决方案。意识到这些挑战是全行业共有的,该项目后来于 2018 年开源并捐赠给 Apache 软件基金会 (ASF),以促进更广泛的协作和采用。
Q: What technology problem is Apache Iceberg solving? 问:Apache Iceberg 解决了什么技术问题?
A: Iceberg addresses several fundamental issues in traditional data lakes:
- Lack of consistency when multiple engines works on the same data
- Complex and brittle partitioning strategies
- Interoperability at the storage layer
- Challenges with schema and partition evolution By rethinking how tables are defined and managed, Iceberg enables scalable, reliable data operations without the overhead and fragility of legacy approaches.
答: Iceberg 解决了传统数据湖中的几个根本性问题:
- 当多个引擎处理同一数据时缺乏一致性
- 复杂且脆弱的分区策略
- 存储层的互操作性问题
- 模式和分区演进带来的挑战 通过重新思考表的定义和管理方式,Iceberg 实现了可扩展、可靠的数据操作,且无需承担传统方法带来的开销和脆弱性。
Q: What were some of the most important design decisions that shaped Iceberg early on? 问:在 Iceberg 早期,哪些最重要的设计决策塑造了它?
A: A few key principles guided Iceberg’s design:
- Treating metadata as a first-class concern for performance and scalability
- Decoupling logical table structure from physical storage layout
- Supporting schema evolution as a core feature, not an afterthought
- Enabling engine-agnostic access to the data These decisions allowed Iceberg to avoid many of the constraints that limited earlier systems.
答: 几个关键原则指导了 Iceberg 的设计:
- 将元数据视为性能和可扩展性的核心关注点
- 将逻辑表结构与物理存储布局解耦
- 将模式演进作为核心功能支持,而非事后补救
- 实现对数据的引擎无关访问 这些决策使 Iceberg 避免了许多限制早期系统的约束。
Real-World Use and Impact
实际应用与影响
Q: Iceberg is known for working across multiple compute engines. Why was that so important from the start? 问:Iceberg 以能够跨多个计算引擎工作而闻名。为什么这一点从一开始就如此重要?
A: Interoperability is essential because modern data ecosystems rely on multiple processing engines. Iceberg was designed to act as a shared table layer, enabling different tools to safely access the same data without tight coupling. This approach gives organizations flexibility and helps prevent vendor lock-in.
答: 互操作性至关重要,因为现代数据生态系统依赖于多种处理引擎。Iceberg 被设计为共享表层,使不同的工具能够安全地访问同一数据,而无需紧密耦合。这种方法为企业提供了灵活性,并有助于防止供应商锁定。
Q: Are there any use cases you would like to tell us about? 问:有哪些用例是你愿意分享的吗?
A: Iceberg is used across industries for:
- Large-scale analytics and reporting
- AI pipelines
- Streaming and batch data processing It enables teams to unify different workloads on a single, reliable data foundation.
答: Iceberg 被广泛应用于各行各业,用于:
- 大规模分析和报表
- AI 流水线
- 流式和批处理数据处理 它使团队能够在单一、可靠的数据基础上统一不同的工作负载。
The Challenge: Explaining a New Layer of the Data Stack
挑战:解释数据栈中的新层级
Q: What made evangelizing Iceberg particularly challenging in those early days? 问:在早期阶段,推广 Iceberg 的最大挑战是什么?
A: The challenge wasn’t just that Iceberg was new – it was that the problem it addressed wasn’t clearly recognized. From the outside, most systems appeared to be working. Data warehouses handled structured workloads, data lakes handled large-scale storage, and teams had already built processes around them. So when we started talking about open table formats and formal specifications, it didn’t feel like solving an urgent problem. Even when issues did exist, they weren’t attributed to the table abstraction itself. Slow jobs, partitioning limitations, schema breakages, or data corruption from concurrent writes were seen as isolated operational problems. Teams would patch them with scripts, conventions, or workarounds rather than questioning the underlying design. That made the conversation harder. We weren’t just introducing a new approach – we were pointing out that a foundational layer people relied on had limitations they hadn’t fully understood yet.
答: 挑战不仅在于 Iceberg 是新的,还在于它所解决的问题并没有被清晰地认识到。从外部看,大多数系统似乎都在正常工作。数据仓库处理结构化工作负载,数据湖处理大规模存储,团队已经围绕它们构建了流程。因此,当我们开始谈论开放表格式和正式规范时,这看起来并不像是在解决一个紧迫的问题。即使确实存在问题,人们也不会将其归咎于表抽象本身。作业缓慢、分区限制、模式损坏或并发写入导致的数据损坏,都被视为孤立的操作问题。团队通常会用脚本、约定或变通方法来修补它们,而不是质疑底层设计。这使得沟通变得更加困难。我们不仅仅是在引入一种新方法,我们还在指出人们所依赖的基础层存在他们尚未完全理解的局限性。