Catching Goroutine Leaks in Go Tests With goleak

Catching Goroutine Leaks in Go Tests With goleak

使用 goleak 在 Go 测试中捕获 Goroutine 泄漏

You’ve seen the graph. A Go service that idles at 40 goroutines climbs to 4,000 over three days, then the pod gets OOM-killed and restarts, and the count starts climbing again. Nothing in the test suite failed. go test ./... is green. go vet is quiet. The leak is real, it’s in your code, and none of your tooling looked for it.

你一定见过这样的图表:一个 Go 服务在空闲时维持 40 个 goroutine,三天内攀升至 4,000 个,随后 Pod 因 OOM(内存溢出)被杀掉并重启,计数器又开始新一轮的攀升。测试套件中没有任何报错,go test ./... 显示通过,go vet 也保持安静。泄漏是真实存在的,它就在你的代码里,但没有任何工具能检测到它。

Goroutine leaks are the bug your tests are structurally blind to. A test spawns a goroutine, asserts on a return value, and exits. The goroutine it left parked on a channel read is somebody else’s problem — the test process already moved on. Multiply that by every handler and worker in the codebase and you get the graph above.

Goroutine 泄漏是测试在结构上无法察觉的 Bug。测试启动一个 goroutine,断言返回值,然后退出。它留在通道读取操作上挂起的 goroutine 成了“别人的问题”——测试进程已经继续执行了。将这种情况乘以代码库中的每一个处理器和工作线程,你就会得到上面提到的那种图表。

goleak, from Uber, closes that gap. It snapshots the set of running goroutines at the end of a test run and fails if any unexpected ones are still alive. The fix is one file per package. Here is how leaks happen, and how to read the stack goleak hands you back to the exact go func() that leaked.

Uber 开源的 goleak 填补了这一空白。它会在测试运行结束时对正在运行的 goroutine 集合进行快照,如果发现任何意外存活的 goroutine,测试就会失败。修复方法是为每个包添加一个文件。下面将介绍泄漏是如何发生的,以及如何解读 goleak 返回的堆栈信息,从而定位到导致泄漏的具体 go func()

What a goroutine leak actually is

什么是真正的 Goroutine 泄漏

A goroutine leaks when it blocks forever on an operation that will never complete, and nothing holds a reference that could unblock it. The scheduler parks it. The garbage collector can’t reclaim it, because a parked goroutine is still a live root. It sits there holding its stack, its captured variables, and whatever those point at, until the process dies.

当一个 goroutine 在一个永远不会完成的操作上永久阻塞,且没有任何引用可以将其解除阻塞时,就会发生泄漏。调度器会将其挂起。垃圾回收器无法回收它,因为挂起的 goroutine 仍然是一个活跃的根对象。它会一直占用着栈空间、捕获的变量以及这些变量指向的所有内容,直到进程终止。

Here’s a leak that looks like reasonable code: 这是一个看起来很合理的泄漏代码示例:

func Fanout(items []int) <-chan int {
    out := make(chan int)
    go func() {
        for _, n := range items {
            out <- n * 2
        }
        close(out)
    }()
    return out
}

The goroutine sends on an unbuffered channel. If the caller reads every value, this is fine. If the caller reads two values and returns early (an error, a break, a context cancel), the goroutine blocks on out <- n*2 with no reader. It never reaches close(out). It never returns. That’s the leak. The test that “covers” this function reads the whole channel, so it never triggers the early-return path. Green test, live bug.

该 goroutine 在一个无缓冲通道上发送数据。如果调用者读取了所有值,这没问题。但如果调用者只读取了两个值就提前返回(例如发生错误、中断或上下文取消),该 goroutine 就会在 out <- n*2 处阻塞,因为没有读取者。它永远无法执行到 close(out),也永远不会返回。这就是泄漏。而“覆盖”此函数的测试读取了整个通道,因此从未触发提前返回路径。测试通过了,但 Bug 依然存在。

Wiring goleak into TestMain

将 goleak 集成到 TestMain 中

The one line that catches the whole package is TestMain. Add this file once per package: 只需在 TestMain 中添加一行代码,即可覆盖整个包。在每个包中添加此文件:

package worker
import (
    "testing"
    "go.uber.org/goleak"
)

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

VerifyTestMain runs all the tests in the package, then checks for stray goroutines after they finish. If any test left one parked, the package fails with a stack dump pointing at where the goroutine was spawned. No per-test boilerplate, no assertions to write. One file covers every test in the package.

VerifyTestMain 会运行包中的所有测试,并在测试结束后检查是否有残留的 goroutine。如果任何测试留下了挂起的 goroutine,包测试就会失败,并输出堆栈转储,指出 goroutine 是在哪里启动的。无需为每个测试编写样板代码,也无需编写断言。一个文件即可覆盖包中的所有测试。

Install it with: go get go.uber.org/goleak 安装命令:go get go.uber.org/goleak

If you only want to guard one specific test instead of the whole package, call VerifyNone at the top with a deferred check: 如果你只想保护某个特定的测试而不是整个包,可以在测试开头调用 VerifyNone 并使用 defer 进行检查:

func TestFanoutEarlyReturn(t *testing.T) {
    defer goleak.VerifyNone(t)
    out := Fanout([]int{1, 2, 3, 4, 5})
    <-out // read one value, then walk away
}

That defer runs after the test body, finds the sender still parked on out <- ..., and fails the test. The leak that shipped to production now fails in CI instead.

这个 defer 会在测试主体执行完毕后运行,发现发送者仍然挂起在 out <- ... 上,从而使测试失败。原本会发布到生产环境的泄漏,现在在 CI 阶段就会被拦截。

Reading the leaked-stack output

解读泄漏堆栈输出

The value of goleak is in the report, not just the red X. When TestFanoutEarlyReturn fails, you get something close to this: goleak 的价值在于报告,而不仅仅是那个红色的“失败”标记。当 TestFanoutEarlyReturn 失败时,你会得到类似这样的输出:

found unexpected goroutines:
[Goroutine 34 in state chan send, with worker.Fanout.func1 on top of the stack:
worker.Fanout.func1()
    /app/worker/fanout.go:11 +0x5c
created by worker.Fanout in goroutine 6
    /app/worker/fanout.go:9 +0x7d
]

Read it bottom to top, then top to bottom, and it tells you the whole story in three lines. 从下往上读,再从上往下读,它会在三行内告诉你整个故事。

  1. created by worker.Fanout … fanout.go:9 — this is the spawn site. Line 9 is the go func(). That’s the goroutine’s birth certificate: which function started it and on what line. 由 worker.Fanout 创建 … fanout.go:9 — 这是启动位置。第 9 行是 go func()。这是 goroutine 的出生证明:哪个函数启动了它,以及在第几行。

  2. worker.Fanout.func1() … fanout.go:11 — this is where it’s stuck right now. Line 11 is out <- n * 2. The top-of-stack frame is the exact statement the goroutine is blocked on. worker.Fanout.func1() … fanout.go:11 — 这是它当前卡住的地方。第 11 行是 out <- n * 2。栈顶帧就是 goroutine 被阻塞的具体语句。

  3. in state chan send — this is why it’s stuck. The scheduler state tells you the class of leak at a glance: 处于 chan send 状态 — 这是它卡住的原因。调度器状态让你一眼就能看出泄漏的类型:

    • chan send / chan receive — blocked on a channel with no peer. (在没有接收/发送方的通道上阻塞)
    • select — blocked in a select with no ready case (often a missing ctx.Done() arm). (在没有就绪分支的 select 中阻塞,通常是缺少 ctx.Done() 分支)
    • semacquire — waiting on a sync.Mutex or WaitGroup that never releases. (等待永远不会释放的 Mutex 或 WaitGroup)
    • IO wait — parked in a syscall, usually a read with no deadline. (在系统调用中挂起,通常是没有截止时间的读取操作)

Spawn site plus stuck line plus state is the full diagnosis. You go straight to fanout.go:11, see the unbuffered send, and know the fix is to make the sender respect the caller giving up. 启动位置加上卡住的行号再加上状态,就是完整的诊断结果。你可以直接跳转到 fanout.go:11,看到无缓冲的发送操作,并知道修复方法是让发送者能够感知到调用者的放弃。

Fixing the leak

修复泄漏

The sender needs a way out when the reader stops. A context plus a select gives it one: 当读取者停止时,发送者需要一个退出机制。使用 context 配合 select 可以实现:

func Fanout(ctx context.Context, items []int) <-chan int {
    out := make(chan int)
    go func() {
        defer close(out)
        for _, n := range items {
            select {
            case out <- n * 2:
            case <-ctx.Done():
                return
            }
        }
    }()
    return out
}

Now the caller cancels the context when it walks away early, the select takes the ctx.Done() arm, and the goroutine returns instead of parking forever. Re-run the test with goleak watching and it passes, because there’s nothing left alive to find. 现在,当调用者提前离开时,它会取消上下文,select 会执行 ctx.Done() 分支,goroutine 就会返回,而不是永久挂起。再次运行带有 goleak 监控的测试,它就会通过,因为没有残留的 goroutine 了。

Handling goroutines that are supposed to outlive tests

处理预期会超出测试生命周期的 Goroutine

Real programs have background goroutines that start once and run for the life of the process: a connection pool’s health checker, a metrics flusher, the goroutine database/sql keeps for connection cleanup. goleak would flag those as leaks, because from its point of view they are unexpected survivors. That’s what IgnoreTopFunction is for. You tell goleak which known-good goroutines to skip: 真实的程序会有一些后台 goroutine,它们启动一次并伴随进程生命周期运行:例如连接池的健康检查器、指标刷新器,或者 database/sql 为连接清理保留的 goroutine。goleak 会将这些标记为泄漏,因为它认为这些是“意外幸存者”。这就是 IgnoreTopFunction 的作用。你可以告诉 goleak 忽略哪些已知的正常 goroutine:

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m, goleak.IgnoreTopFunction(
        "database/sql.(*DB).connectionOpener",
    ))
}

The string is the exact top-of-stack function name from the report — the same worker.Fanout.func1 you learned to read above. Copy it from goleak’s own output into the ignore list. Keep the list short and specific. 这个字符串就是报告中栈顶函数的完整名称——和你上面学会解读的 worker.Fanout.func1 一样。直接从 goleak 的输出中复制它到忽略列表中即可。请保持列表简短且具体。