I Thought Data Engineering Was Just Writing Scripts. I Was Wrong.
I Thought Data Engineering Was Just Writing Scripts. I Was Wrong.
我曾以为数据工程只是写写脚本。我错了。
Data Engineering I Thought Data Engineering Was Just Writing Scripts. I Was Wrong. I tried to make my ETL pipeline production-ready. Three things broke. Each one taught me something scripting alone never could. 数据工程:我曾以为数据工程只是写写脚本。我错了。我尝试将我的 ETL 流水线(Pipeline)转化为生产就绪状态,结果遇到了三个问题。每一个问题都教会了我单纯写脚本永远无法领悟的道理。
After I built my first ETL pipeline, I thought I had a pretty good grip on what data engineering actually was. You extract data from somewhere, you clean it up, you load it somewhere useful. ETL. Simple enough. 在我构建了第一个 ETL 流水线后,我以为自己已经很好地掌握了数据工程的本质。你从某处提取数据,进行清洗,然后加载到有用的地方。ETL,就这么简单。
For context, I am a data analyst trying to transition into data engineering. I have been documenting that journey publicly, starting with a 12-month self-study roadmap I put together earlier this year. The most recent step in that journey was building my first ETL pipeline from scratch using the GitHub API, which I wrote about here on TDS. 背景说明一下,我是一名正试图转型为数据工程师的数据分析师。我一直在公开记录这段旅程,始于我今年早些时候制定的 12 个月自学路线图。这段旅程中最近的一步是使用 GitHub API 从零开始构建我的第一个 ETL 流水线,我已经在 TDS 上写过相关文章。
That pipeline worked. It pulled data, cleaned it, saved it to a CSV. I was happy with it. So I decided to push it further, make it more “production-ready” as the internet likes to say. What happened next genuinely surprised me. Not because things broke, but because of what the breaking revealed. 那个流水线运行成功了。它抓取数据、清洗数据,并保存为 CSV 文件。我对它很满意。于是我决定更进一步,让它变得更“生产就绪”(production-ready),正如互联网上常说的那样。接下来的事情让我大吃一惊。不是因为程序崩溃了,而是因为崩溃所揭示的问题。
The Original Pipeline
原始流水线
The original pipeline was basic, which was fine because that was the point. Extract data from the GitHub API, do a bit of cleaning, save everything to a CSV file. It worked perfectly for what it was: a learning exercise. But a CSV file and a one-time script is not how data engineering works in the real world. I wanted to find out what “real world” actually meant in practice, so I decided to push the pipeline further and see what happened. 原始流水线非常基础,但这正是目的所在。从 GitHub API 提取数据,做一点清洗,然后保存到 CSV 文件中。作为一次学习练习,它运行得非常完美。但 CSV 文件和一次性脚本并不是现实世界中数据工程的工作方式。我想弄清楚“现实世界”在实践中到底意味着什么,所以我决定进一步推进这个流水线,看看会发生什么。
(Code snippet omitted for brevity) (代码片段略)
Simple, readable, and it works. But the moment you try to run it more than once, or come back to it the next day, the cracks start to show. 简单、可读且有效。但当你尝试运行它超过一次,或者第二天再回来查看时,裂痕就开始显现了。
Wall One: The Pipeline Had No Memory
第一道墙:流水线没有“记忆”
The first upgrade was straightforward. Instead of saving to a CSV file, I loaded the data into a SQLite database. SQLite is still just a single file, but it behaves like a real database. You can query it, check what’s already in it, and build on top of it properly. It felt like a small change. It wasn’t. 第一次升级很简单。我没有保存为 CSV 文件,而是将数据加载到 SQLite 数据库中。SQLite 虽然只是一个文件,但它的行为像一个真正的数据库。你可以查询它,检查里面已有的内容,并在其基础上进行适当的构建。这感觉像是一个小改动,但其实不然。
I ran the pipeline once and got 22 repos. Then I ran it a second time without changing anything and checked the database. Total rows: 44. Unique repos: 22. Duplicates: 22. 我运行了一次流水线,得到了 22 个仓库。然后我在没有任何改动的情况下运行了第二次,并检查了数据库。总行数:44。唯一仓库:22。重复项:22。
Honestly, I didn’t expect it. I guessed it could happen but I never really thought it would. But I’m glad it did, because it was the first time I actually watched my pipeline break. And what it revealed was simple but important: the script had no memory. Every time it ran, it started completely fresh and blindly appended whatever it found. No warning, no error. Just “Pipeline complete” like everything was fine. 老实说,我没预料到这一点。我猜到可能会发生,但从未真正认为它会发生。但我很高兴它发生了,因为这是我第一次亲眼目睹我的流水线“崩溃”。它揭示了一个简单但重要的问题:脚本没有记忆。每次运行,它都从零开始,盲目地追加它找到的任何内容。没有警告,没有错误。只是显示“流水线完成”,仿佛一切正常。
This is where I came across a concept called idempotency. Idempotency is a fancy word for a simple idea. If something has already happened, it shouldn’t happen again. In the context of a data pipeline, it means that running your pipeline once or running it ten times should always produce the same result. No extra rows, no duplicates, no silent corruption of your data. 这就是我接触到“幂等性”(idempotency)这个概念的地方。幂等性是一个简单概念的华丽称呼。如果某件事已经发生过,它就不应该再次发生。在数据流水线的语境下,这意味着无论你运行一次还是十次,结果都应该是一样的。没有多余的行,没有重复项,也不会悄无声息地损坏你的数据。
The fix was straightforward in principle. Before inserting anything into the database, the pipeline now checks if that record already exists. If it does, it removes it first, then inserts the fresh version. One small change in thinking, but it completely changes how reliable your pipeline is. And here’s the part that stuck with me: a basic script will never think about this on its own. You have to build it in deliberately. That’s not scripting anymore. That’s engineering. 修复原则很简单。在向数据库插入任何内容之前,流水线现在会检查该记录是否已存在。如果存在,先删除它,然后再插入最新版本。思维上的一个小转变,却彻底改变了流水线的可靠性。让我印象深刻的是:一个基础脚本永远不会自动考虑到这一点。你必须刻意地将其构建进去。这不再是写脚本了,这是工程。
Wall Two: The Data Disappeared Overnight
第二道墙:数据一夜之间消失了
The second wall was less technical and more unsettling. When I closed Colab for the night and came back the next day, there was this uneasy feeling. I had to run everything again from scratch and hope nothing broke, even though everything had worked perfectly the night before. The database I had carefully built was just gone. 第二道墙技术含量较低,但更令人不安。当我晚上关闭 Colab,第二天回来时,有一种不安的感觉。我必须从头开始重新运行所有内容,并祈祷不要出问题,尽管前一天晚上一切都运行得非常完美。我精心构建的数据库就这样消失了。
And I remembered that before this project, I had actually struggled to find my original ETL pipeline file. I spent time looking for it before I finally did. That feeling of almost losing your work stays with you. I knew there had to be a better way. A real pipeline cannot depend on someone being there to rerun it every morning. The data has to live somewhere that survives beyond the session. 我记得在这个项目之前,我曾费力寻找我的原始 ETL 流水线文件。我花了不少时间才找到它。那种差点丢失工作的恐惧感会一直伴随着你。我知道一定有更好的方法。一个真正的流水线不能依赖于有人每天早上在那里手动重新运行它。数据必须存储在能够超越会话生命周期的某个地方。
The fix here was mounting Google Drive directly inside Colab and pointing the database connection there instead of to the temporary Colab environment. One line change: conn = sqlite3.connect('/content/drive/MyDrive/github_repos.db'). Now the database lives in Google Drive. Close the session, restart the runtime, open a new notebook entirely. The data is still there waiting. But this fix revealed something bigger. If persisting data already requires thinking about where things live and how they survive, what happens when you need the pipeline to run automatically every day without you touching it at all?
这里的修复方法是将 Google Drive 直接挂载到 Colab 中,并将数据库连接指向那里,而不是临时的 Colab 环境。只需修改一行代码:conn = sqlite3.connect('/content/drive/MyDrive/github_repos.db')。现在数据库存储在 Google Drive 中。关闭会话、重启运行时、打开一个全新的笔记本,数据依然在那里等着。但这个修复揭示了一个更大的问题:如果持久化数据已经需要考虑数据存储在哪里以及如何存续,那么当你需要流水线每天自动运行而无需你干预时,又会发生什么呢?
Wall Three: Nobody Can Press Run Forever
第三道墙:没人能永远手动点击“运行”
The third wall was the one that… 第三道墙是……