CI/CD Pipelines That Actually Work: From “Why Is This Broken?” to “I Feel Like Neo”

CI/CD Pipelines That Actually Work: From “Why Is This Broken?” to “I Feel Like Neo”

真正好用的 CI/CD 流水线:从“为什么又崩了?”到“我感觉自己像尼奥”

The Quest Begins (The “Why”)

探索的开始(“为什么”)

Picture this: I’m hunched over my laptop at 2 a.m., coffee cold, staring at a red GitHub Actions badge that stubbornly reads “failed”. I’d just pushed a tiny UI tweak, and the pipeline decided it was the perfect moment to remind me that my node_modules cache was older than the dinosaurs in Jurassic Park. I’d spent the last hour wrestling with a flaky test that only failed on Windows runners, and the CI log was a wall of text that looked like the scrolls in The Lord of the Rings – epic, but completely indecipherable. Honestly, I felt like a rookie Jedi trying to lift an X‑wing with the Force while everyone else was already flying Millennium Falcons. The pain point? Reliability. My team kept shipping hotfixes because the pipeline would randomly drop artifacts, skip steps, or hang forever on a dependency install. I needed a pipeline that didn’t feel like a boss battle every time I hit merge. So I embarked on a quest: find the holy grail of CI/CD that just works across GitHub Actions, GitLab CI, and Jenkins – the three kingdoms I most often serve.

想象一下:凌晨两点,我弓着背坐在笔记本电脑前,咖啡早已冰凉,死死盯着 GitHub Actions 那枚顽固显示“failed”的红色徽章。我刚刚推送了一个微小的 UI 调整,流水线却偏偏选在这个时候提醒我:我的 node_modules 缓存比《侏罗纪公园》里的恐龙还要古老。过去的一小时里,我一直在与一个只在 Windows Runner 上报错的“不稳定测试”(flaky test)作斗争,而 CI 日志就像《指环王》里的卷轴一样——宏大,但完全无法解读。老实说,我感觉自己就像个试图用原力举起 X 翼战机的绝地学徒,而其他人早已开上了千年隼号。痛点是什么?可靠性。我的团队不得不频繁发布热修复,因为流水线总是随机丢失构建产物、跳过步骤,或者在安装依赖时无限卡死。我需要一条流水线,而不是每次点击合并时都像在打一场 Boss 战。于是,我踏上了征程:寻找 CI/CD 的“圣杯”,一套能在 GitHub Actions、GitLab CI 和 Jenkins 这三个我最常打交道的“王国”中稳定运行的方案。

The Revelation (The Insight)

启示(洞察)

After a weekend of digging through docs, trial‑and‑error, and a few “why did I even try this?” moments, the breakthrough came when I stopped treating each CI system as a black box and started thinking about pipeline as code – the same way we treat application code. The magic? Declarative, version‑controlled, and reusable steps that are isolated, cache‑smart, and fail fast. Think of it like Neo seeing the Matrix: once you realize the pipeline is just another piece of software, you can refactor, test, and version it just like your src folder. The three pillars that made the difference for me: Explicit caching – store only what really changes (lockfiles, build artifacts). Matrix builds with fail‑fast – test multiple environments but bail out on the first failure to save time. Self‑contained jobs – each job pulls its own dependencies, uses containers where possible, and leaves no dirty state behind. When I applied these ideas, the red badge turned green, and the pipeline ran in under 5 minutes instead of the dreaded 20‑minute slog. I felt like I’d just discovered the secret level in Super Mario Bros. – a warp pipe straight to the flagpole.

经过一个周末翻阅文档、反复试错,以及几次“我为什么要折腾这个?”的自我怀疑后,突破点终于出现了:我不再把每个 CI 系统当作黑盒,而是开始像对待应用程序代码一样,将流水线视为“代码”。其中的奥秘在于:声明式、版本控制、可复用的步骤,并且要做到隔离、智能缓存和快速失败。这就像尼奥看穿了矩阵:一旦你意识到流水线也只是另一段软件,你就可以像重构 src 文件夹一样去重构、测试和版本化它。对我而言,三个核心支柱带来了改变:显式缓存——只存储真正变化的内容(锁文件、构建产物);带快速失败的矩阵构建——测试多个环境,但一旦发现失败立即停止以节省时间;自包含任务——每个任务拉取自己的依赖,尽可能使用容器,且不留下任何脏状态。当我应用这些理念后,红色徽章变成了绿色,流水线运行时间从令人痛苦的 20 分钟缩短到了 5 分钟以内。我感觉自己就像发现了《超级马里奥》里的隐藏关卡——一条直通终点旗杆的水管。

Wielding the Power (Code & Examples)

掌握力量(代码与示例)

Below are the “before” (the struggle) and “after” (the victory) snippets for each platform. I’ll point out the traps I fell into so you can dodge them like a pro.

以下是每个平台“之前”(挣扎)和“之后”(胜利)的代码片段。我会指出我踩过的坑,这样你就能像专家一样避开它们。

GitHub Actions – The Cache That Actually Caches

GitHub Actions – 真正有效的缓存

Before – naive caching that stored the whole node_modules folder, causing huge restores and occasional corruption. 之前 – 天真的缓存方式,存储了整个 node_modules 文件夹,导致恢复过程极其缓慢,且偶尔会损坏。

# .github/workflows/ci.yml (before)
name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Cache node_modules
        uses: actions/cache@v3
        with:
          path: ~/.npm
          key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-node-
      - run: npm ci
      - run: npm test

The trap: caching ~/.npm works, but if you lock the key only to package-lock.json and change a dependency version, you get a miss and end up reinstalling everything from scratch – still slow, but at least correct. The real win came when I split the cache into two layers: one for the lockfile (restore) and another for the actual node_modules (save).

陷阱:缓存 ~/.npm 是有效的,但如果你只锁定 package-lock.json 作为 key,一旦依赖版本变更,缓存就会失效,导致一切从头安装——虽然结果正确,但依然很慢。真正的胜利在于我将缓存拆分为两层:一层用于锁文件(恢复),另一层用于实际的 node_modules(保存)。

After – split‑cache strategy + matrix with fail‑fast. 之后 – 拆分缓存策略 + 带快速失败的矩阵构建。

# .github/workflows/ci.yml (after)
name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [18.x, 20.x]
      fail-fast: true # <-- stop as soon as one version fails
    steps:
      - uses: actions/checkout@v3
      - name: Cache npm packages (lockfile)
        uses: actions/cache@v3
        with:
          path: ~/.npm
          key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-node-
      - name: Cache node_modules (actual)
        uses: actions/cache@v3
        with:
          path: ~/.npm/_cachelog
          key: ${{ runner.os }}-node-modules-${{ hashFiles('package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-node-modules-
      - name: Setup Node
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'
      - run: npm ci
      - run: npm test

Why it works: The first cache restores the download folder (fast), the second caches the already‑installed modules (even faster). The matrix runs two Node versions in parallel, and fail-fast means we don’t waste time if the first version blows up.

为什么有效:第一个缓存恢复下载文件夹(快),第二个缓存已安装的模块(更快)。矩阵构建并行运行两个 Node 版本,而快速失败意味着如果第一个版本报错,我们不会浪费时间。

GitLab CI – Docker‑in‑Docker Done Right

GitLab CI – 正确使用 Docker-in-Docker

Before – trying to build Docker images inside a shell executor, leading to permission errors and flaky layer caches. 之前 – 尝试在 shell 执行器内构建 Docker 镜像,导致权限错误和不稳定的层缓存。

# .gitlab-ci.yml (before)
stages:
  - build
  - test

build_image:
  stage: build
  script:
    - docker build -t myapp:$CI_COMMIT_SHA .
    - docker push myapp:$CI_COMMIT_SHA

run_tests:
  stage: test
  script:
    - docker run --rm myapp:$CI_COMMIT_SHA npm test

The trap: GitLab’s shared runners often run with a restrictive AppArmor profile; Docker‑in‑DinD needs privileged mode, and without it you get “operation not permitted”. Plus, each job started from scratch, pulling base images every time.

陷阱:GitLab 的共享 Runner 通常运行在受限的 AppArmor 配置下;Docker-in-Docker 需要特权模式,否则会报“操作不允许”。此外,每个任务都是从零开始,每次都要重新拉取基础镜像。

After – use the Docker executor with privileged mode and cache layers via :cache-from. 之后 – 使用带特权模式的 Docker 执行器,并通过 :cache-from 缓存层。

# .gitlab-ci.yml (after)
stages:
  - build
  - test

variables:
  DOCKER_DRIVER: overlay2 # faster storage driver
  IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

build:
  stage: build
  image: docker:latest
  services:
    - docker:dind # Docker-in-Docker service
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  before_script:
    - docker info
  script:
    - docker build --cache-from $IMAGE -t $IMAGE .
    - docker push $IMAGE

test:
  stage: test
  image: docker:latest
  services:
    - docker:dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  script:
    - docker pull $IMAGE
    - docker run --rm $IMAGE npm test

Why it works: By declaring docker:dind as a service, GitLab gives us a privileged Docker daemon inside the job. The —cache-from flag tells Docker to reuse previously pushed layers, turning a 2‑minute build into a 20‑second incremental one. No more “permission denied” surprises.

为什么有效:通过将 docker:dind 声明为服务,GitLab 在任务内部为我们提供了一个特权 Docker 守护进程。—cache-from 标志告诉 Docker 重用之前推送的层,将 2 分钟的构建缩短为 20 秒的增量构建。再也不会有“权限拒绝”的意外了。