singleflight in Go: Collapsing Duplicate Work Under Load

A hot key expires in Redis. In the same millisecond, 5,000 in-flight requests miss the cache and all decide to rebuild it. Every one of them runs the same expensive query against the same row. The database goes from bored to on fire, the query gets slower, more requests pile up behind it, and the cache never gets a chance to refill. That’s a cache stampede. The load isn’t higher than usual. The work is just duplicated 5,000 times over when one call would have served everyone. Go has a small package for exactly this shape of problem: golang.org/x/sync/singleflight.

Redis 中的热点 Key 过期了。在同一毫秒内，5,000 个并发请求未命中缓存，并同时决定重建它。每一个请求都针对同一行数据执行了相同的昂贵查询。数据库瞬间从空闲变为过载，查询变慢，更多的请求堆积在后面，缓存永远没有机会被重新填充。这就是“缓存雪崩”（Cache Stampede）。负载其实并没有比平时高，只是原本一次调用就能服务所有人的工作，被重复执行了 5,000 次。Go 语言有一个专门解决此类问题的小型包：golang.org/x/sync/singleflight。

What singleflight does

singleflight 的作用

singleflight.Group guarantees that for a given key, only one execution of a function runs at a time. Concurrent callers with the same key wait for that single execution and receive its result.

singleflight.Group 保证对于给定的 Key，同一时间只会执行一次函数。使用相同 Key 的并发调用者会等待那一次执行，并接收其结果。

import "golang.org/x/sync/singleflight"

var group singleflight.Group

func GetUser(ctx context.Context, id string) (*User, error) {
    v, err, _ := group.Do(id, func() (any, error) {
        return loadUserFromDB(ctx, id)
    })
    if err != nil {
        return nil, err
    }
    return v.(*User), nil
}

Do takes a string key and a function. The first caller for a key runs the function. Any other caller that arrives with the same key while that function is still running blocks, then gets handed the same (value, error) when it finishes. One database read serves the whole crowd.

Do 接收一个字符串 Key 和一个函数。该 Key 的第一个调用者会运行此函数。任何在函数运行期间到达的、使用相同 Key 的其他调用者都会被阻塞，并在函数完成后获得相同的 (value, error) 结果。一次数据库读取即可服务所有请求。

The third return value is a bool named shared. It tells you whether this result was handed to more than one caller:

第三个返回值是一个名为 shared 的布尔值。它告诉你该结果是否被分发给了多个调用者：

v, err, shared := group.Do(id, fn)
// shared == true means v went to several waiters at once.

Useful for a metric. If shared is true a lot, you know the collapsing is doing real work.

这对于指标监控很有用。如果 shared 经常为 true，说明合并请求的操作正在发挥实际作用。

Watching the collapse

观察合并效果

Here’s the behavior made visible. A hundred goroutines call the same key inside a 50ms window. The underlying function counts how many times it actually ran.

以下是该行为的可视化演示。100 个 Goroutine 在 50 毫秒内调用同一个 Key。底层的函数会统计它实际运行了多少次。

func main() {
    var g singleflight.Group
    var calls int64
    fetch := func() (any, error) {
        atomic.AddInt64(&calls, 1)
        time.Sleep(50 * time.Millisecond) // slow work
        return "payload", nil
    }

    var wg sync.WaitGroup
    for i := 0; i < 100; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            g.Do("user:42", fetch)
        }()
    }
    wg.Wait()
    fmt.Println(atomic.LoadInt64(&calls)) // 1
}

One hundred goroutines, one execution. That is the whole pitch. One thing to be clear about: singleflight is not a cache. Do deletes the key from its internal map the moment the function returns. The next call for the same key runs the function again. It only collapses calls that overlap in time. So it sits in front of your cache-load path, it doesn’t replace the cache. The pattern is: check cache, on a miss call group.Do to load, write the result back to cache.

100 个 Goroutine，1 次执行。这就是它的核心价值。需要明确的一点是：singleflight 不是缓存。Do 在函数返回的瞬间就会从其内部映射中删除该 Key。下一次针对同一个 Key 的调用会再次运行该函数。它只合并时间上重叠的调用。因此，它位于缓存加载路径的前端，而不是替代缓存。其模式是：检查缓存，未命中则调用 group.Do 加载，并将结果写回缓存。

The shared-result caveat

共享结果的注意事项

The sharp edge is in the word shared. Every waiter gets the exact same value. If that value is a pointer, a slice, or a map, all of them now hold a reference to the same underlying data.

这里的一个“坑”在于 shared 这个词。每个等待者都会得到完全相同的值。如果该值是一个指针、切片或映射，那么它们现在都持有对同一底层数据的引用。

v, _, _ := group.Do(id, func() (any, error) {
    return loadUserFromDB(ctx, id) // returns *User
})
u := v.(*User)
u.LastSeen = time.Now() // every other caller sees this write

That mutation races with every other goroutine that received the same *User. It’s a classic data race, and it will not show up in a quick test where one caller wins. It shows up under load, which is the only time singleflight does anything at all.

这种修改会与所有其他接收到相同 *User 的 Goroutine 产生竞争。这是一个典型的数据竞争，在只有一个调用者获胜的快速测试中不会显现。它只会在高负载下出现，而这正是 singleflight 发挥作用的唯一场景。

Two ways out. Treat the returned value as immutable and never write through the pointer. Or return a value the callers can own, and copy before you hand it back:

有两种解决方法。要么将返回值视为不可变的，永远不要通过指针进行写入；要么返回一个调用者可以拥有的值，并在交出之前进行拷贝：

func GetUser(ctx context.Context, id string) (User, error) {
    v, err, _ := group.Do(id, func() (any, error) {
        u, err := loadUserFromDB(ctx, id)
        return u, err // *User
    })
    if err != nil {
        return User{}, err
    }
    return *v.(*User), nil // copy out, caller owns it
}

Returning a copy means a caller can mutate its User without touching anyone else’s. Whichever rule you pick, write it down next to the code, because the failure mode is silent.

返回一个拷贝意味着调用者可以修改自己的 User 而不会影响其他人。无论你选择哪种规则，请将其写在代码旁边，因为这种失败模式是静默的（难以察觉的）。

DoChan and context

DoChan 与 Context

Do blocks. If a caller wants to give up when its request context is cancelled, use DoChan, which returns a channel instead of blocking:

Do 会阻塞。如果调用者希望在请求的 Context 被取消时放弃等待，可以使用 DoChan，它会返回一个通道而不是阻塞：

func GetUser(ctx context.Context, id string) (*User, error) {
    ch := group.DoChan(id, func() (any, error) {
        return loadUserFromDB(ctx, id)
    })
    select {
    case <-ctx.Done():
        return nil, ctx.Err()
    case res := <-ch:
        if res.Err != nil {
            return nil, res.Err
        }
        return res.Val.(*User), nil
    }
}

Now a caller whose context is cancelled returns right away instead of waiting on the shared work. There’s a subtlety worth knowing. The function runs under the context of whichever caller triggered the flight. Late arrivals wait on work that belongs to the first caller. If that first caller’s context has a tight deadline and cancels, the shared load can fail for everyone who joined it.

现在，Context 被取消的调用者会立即返回，而不是等待共享任务完成。这里有一个值得注意的细节：函数是在触发该任务的第一个调用者的 Context 下运行的。后续到达的调用者等待的是属于第一个调用者的任务。如果第一个调用者的 Context 设置了紧迫的截止时间并被取消，那么所有加入该任务的调用者都会失败。

When the loaded value is meant to be shared across requests, detach the work from any single request’s context. Derive the function’s context from context.WithoutCancel(ctx) or from a background context with its own timeout, so one impatient caller can’t poison the result for the rest.

当加载的值旨在跨请求共享时，请将任务从任何单个请求的 Context 中剥离出来。从 context.WithoutCancel(ctx) 或带有独立超时的后台 Context 中派生函数的 Context，这样就不会因为一个没耐心的调用者而导致其他人的结果被“污染”。

Forget: don’t glue every caller to one failure

Forget：不要让所有调用者绑定到同一次失败上

Because Do drops the key as soon as the function returns, a failure never gets cached across sequential calls. The next request after a failed one starts fresh. So most of the time you don’t touch Forget. Where it earns its place is the in-flight window. Picture a slow load that takes three seconds and then fails. Every request that arrived during those three seconds attached to that one call and shares its error. A transient blip becomes a synchronized failure for a whole batch of users. Forget drops a key from the group so the next caller starts a new execution instead of waiting on the current one.

由于 Do 在函数返回后会立即删除 Key，因此失败不会在连续调用中被缓存。失败后的下一个请求会重新开始。所以大多数情况下你不需要使用 Forget。它真正发挥作用的地方是在任务执行期间（in-flight window）。想象一个耗时 3 秒后失败的缓慢加载任务。在这 3 秒内到达的每个请求都会附着在该任务上并共享其错误。一个短暂的波动变成了整批用户的同步失败。Forget 会从组中删除一个 Key，这样下一个调用者就会开始新的执行，而不是等待当前的执行。