7 Crucial Barriers Between Data Teams and Self-Healing Data Architecture

7 Crucial Barriers Between Data Teams and Self-Healing Data Architecture

数据团队与自愈式数据架构之间的 7 大关键障碍

Data Science | 7 Crucial Barriers Between Data Teams and Self-Healing Data Architecture 数据科学 | 数据团队与自愈式数据架构之间的 7 大关键障碍

What data teams need to build with AI to make self-healing data architecture a practical reality. 数据团队需要利用 AI 构建什么,才能让自愈式数据架构成为现实。


Introduction 引言

For many data engineers, AI examples of data engineering revolve around one thing: fixing a pipeline. An engineer opens up Claude Code, pastes some logs, and a pull request is made. Semantics are fundamental here. Because when people say “self-healing” what they mean is “self-managing”. The key to success in AI is not defined by manual intervention and interaction — but the absence of it. The dream for data teams is a system whereby data pipelines and workflows generally succeed without any human intervention at all. 对于许多数据工程师而言,AI 在数据工程中的应用案例往往围绕着一件事:修复流水线。工程师打开 Claude Code,粘贴一些日志,然后生成一个合并请求(Pull Request)。这里的语义至关重要,因为当人们说“自愈”时,他们真正的意思是“自管理”。AI 成功的关键不在于人工干预和交互,而在于消除这些干预。数据团队的梦想是构建一个系统,使数据流水线和工作流在无需任何人工干预的情况下就能顺利运行。

However, there are barriers that lie in between us and this golden future. Agents require context — fixing a pipeline may be due to a transient error, upstream schema change, or something uncontrollable entirely like a human dropping a table. Experience provides engineering teams with the know-how of how to fix these; context agents are missing. A shift in mindset will also be apparent. The old pattern of “New branch, merge, re-run” is distinctly slow and not agent-y. Unless we are to change our patterns and allow agents to merge PRs as well, this seems like a large mindset shift is required. 然而,在我们与这个黄金未来之间存在着诸多障碍。智能体(Agents)需要上下文——修复流水线可能是因为瞬时错误、上游模式变更,或者是像人为误删表这样完全不可控的情况。经验为工程团队提供了修复这些问题的诀窍,但智能体目前缺乏这种上下文。思维模式的转变也将显而易见。旧的“新建分支、合并、重新运行”模式显然太慢,且不具备智能体特性。除非我们改变模式,允许智能体直接合并 PR,否则这似乎需要巨大的思维转变。

Finally, data does not “branch” well. Projects like Lake FS promised to make “Git for data” mainstream, but it is not. I have been writing about zero-copy cloning for years, but it is still not widely used. The distinctions between code and data are not obvious. In this article, we’ll cover 7 barriers in between the typical data stacks of today and the nirvana of self-healing data pipelines / autonomous data pipelines. Let’s dive in! 最后,数据并不像代码那样容易“分支”。像 Lake FS 这样的项目曾承诺让“数据版 Git”成为主流,但事实并非如此。多年来我一直在撰写关于零拷贝克隆(zero-copy cloning)的文章,但它仍未得到广泛应用。代码与数据之间的界限并不明显。在本文中,我们将探讨阻碍当今典型数据栈迈向自愈式/自主式数据流水线这一“涅槃境界”的 7 大障碍。让我们深入了解吧!


Barrier 1 | Context and failure recall 障碍 1 | 上下文与故障回溯

Pipelines can fail for a plethora of reasons, and being able to fix pipelines period is a requirement for an AI system. We can categorise failures into a few broad types: Infrastructure issues, Code issues, Data Issues, Transient or third party issues. 流水线可能因多种原因失败,而能够修复流水线是 AI 系统的基本要求。我们可以将故障归纳为几大类:基础设施问题、代码问题、数据问题、瞬时或第三方问题。

Generally, the manner of fixing data requires knowledge of the system. For example, Acme’s Kubernetes Cluster may only be accessible by Mr. Bob, who is the only person who has access to Bob’s special access key hidden in AWS Secrets Manager with a non-standard header. AI doesn’t know about Bob’s key, so won’t be able to fix the cluster. Similarly, Analyst Sophie may know that the right thing to do in Widgets Incorporated is to simply gloss over the fact that sales are reported in multiple currencies, and to manipulate the numbers to be 10% higher than the ones yesterday. AI doesn’t know how to treat the numbers. AI may also not know that to failure handle the internal API, you simply need to try it again between 2.47am and 3.12am. 通常,修复数据的方式需要对系统的了解。例如,Acme 公司的 Kubernetes 集群可能只有 Bob 先生能访问,因为他是唯一拥有 Bob 特殊访问密钥的人,该密钥隐藏在带有非标准标头的 AWS Secrets Manager 中。AI 不知道 Bob 的密钥,因此无法修复集群。同样,分析师 Sophie 可能知道在 Widgets 公司,正确的做法是忽略销售额以多种货币报告的事实,并将数字处理为比昨天高出 10%。AI 不知道如何处理这些数字。AI 可能也不知道,要处理内部 API 的故障,只需在凌晨 2:47 到 3:12 之间重试即可。

These are ridiculous examples, but they illustrate the point that the knowledge to fix these different types of errors often exists within individuals’ heads. It is not enough to speak about “metadata context”. While gathering lineage, logs, code, documentation, and other written-down context is undoubtedly imperative, AI is actually pretty good at just working it out. As Data Folks, we’ve all been in a situation where we (or perhaps someone we’ve spoken to) has thought: “How on earth could I have known that?” At the end of the day, only humans know where the bodies are buried. This entire structure is tech debt and could be broken down with AI. 这些例子虽然荒谬,但说明了一个观点:修复这些不同类型错误所需的知识往往存在于个人的脑海中。仅仅谈论“元数据上下文”是不够的。虽然收集血缘关系、日志、代码、文档和其他书面上下文无疑是必要的,但 AI 实际上非常擅长自行推断。作为数据从业者,我们都曾遇到过这种情况,我们(或我们交谈过的人)会想:“我到底是怎么可能知道那个的?”归根结底,只有人类才知道“埋尸地点”(指隐藏的复杂逻辑或隐患)。整个结构都是技术债,而这些都可以通过 AI 来拆解。


Barrier 2 | Elastic infrastructure 障碍 2 | 弹性基础设施

Considering issues of the infrastructure type specifically, I am coining a term “Elastic” infrastructure. “Elastic Infrastructure” does not just scale, but also has an API to manage it. An EC2 instance would not be elastic, as it does not scale beyond a certain point. A Kubernetes cluster on a locked-down machine would not be elastic w.r.t cloud as there would be no API to be managed. The reason is that AI will require access to Infrastructure in order to recover failures from it. SaaS providers should relish this opportunity. SAAS providers necessarily take the management burden of infrastructure from data teams away, for a fee. This is a very AI-friendly approach, but falls down in respect of Barrier 6, which we will get to. 专门考虑基础设施类型的问题,我创造了一个术语“弹性(Elastic)基础设施”。“弹性基础设施”不仅能扩展,还拥有管理它的 API。EC2 实例不是弹性的,因为它无法在超过一定限度后扩展。在受限机器上的 Kubernetes 集群相对于云而言也不是弹性的,因为没有可管理的 API。原因在于,AI 需要访问基础设施才能从中恢复故障。SaaS 提供商应该珍惜这个机会。SaaS 提供商通过收费,必然会从数据团队手中接管基础设施的管理负担。这是一种非常适合 AI 的方法,但在面对障碍 6(我们稍后会提到)时会遇到困难。


Barrier 3 | Operational Agents and Quality Data 障碍 3 | 操作型智能体与数据质量

Pete in Finance has overwritten the Supply and Operations Planning Google Sheet for the US again. The international forecasts are broken, and your pipeline is failing. There are 0 rows in us_forecast_dec_v1 and forecasts_agg is stale. AI is telling you the connectors are fine but there was no data. It can’t do anything. What is the solution here? Let’s play a quiz. I’ll give you some ideas, and you pick the right answer. 财务部的 Pete 又一次覆盖了美国地区的“供应与运营规划” Google 表格。国际预测数据损坏,你的流水线正在报错。us_forecast_dec_v1 中有 0 行数据,forecasts_agg 也已过期。AI 告诉你连接器没问题,但没有数据。它束手无策。解决方案是什么?我们来做个测验。我给你几个选项,你选出正确答案。

Option 1: let AI hallucinate the forecasts Option 2: let AI hallucinate the forecasts in your data warehouse, and re-run the Google Sheet Pipeline later Option 3: AI tells Pete to upload the damn forecasts! Option 4: there is a warm pool of rented humans. When this type of pipeline fails, the AI instructs the warm pool to bother Pete in person until he fixes the pipeline himself, by hand 选项 1:让 AI 幻觉生成预测数据 选项 2:让 AI 在数据仓库中幻觉生成预测数据,稍后再重新运行 Google 表格流水线 选项 3:让 AI 去催 Pete 把该死的预测数据上传! 选项 4:建立一个待命的人力池。当此类流水线失败时,AI 指示人力池去亲自骚扰 Pete,直到他亲手修复流水线。

Of course, there is no right answer! All options are not great, ranging from bad to ludicrous. In fact, Option 4 doesn’t really require AI at all, but something called teamwork. Quality data is, as ever, the most important thing for a data engineer. Data teams should ask this question when they interview more “How good is your data?”. It is such a determinant of quality of life, it is surprising not to get more of a mention. That is not to say that operational agents have no place — for example, genuine fat finger errors could easily be corrected by an operational agent. For example, let’s say there is a new deal for $10m — perhaps the correct number is $1m. An agent with a Salesforce API Key could easily amend the data, and restart a pipeline. 当然,没有正确答案!所有选项都不怎么样,从糟糕到荒谬不等。事实上,选项 4 根本不需要 AI,只需要所谓的“团队合作”。对于数据工程师来说,数据质量一如既往地是最重要的。数据团队在面试时应该更多地问这个问题:“你们的数据质量如何?”它对生活质量有着决定性的影响,令人惊讶的是它没有得到更多的关注。这并不是说操作型智能体没有用武之地——例如,真正的“手滑”错误可以很容易地由操作型智能体纠正。例如,假设有一笔 1000 万美元的新交易——也许正确的数字是 100 万美元。拥有 Salesforce API 密钥的智能体可以轻松修改数据并重启流水线。


Barrier 4 | Git for Data 障碍 4 | 数据版 Git

The previous example raises an important question, which is… 前面的例子提出了一个重要的问题,那就是……