Restartable Sequences
Restartable Sequences
May 31st, 2026 @ justine’s web page
The best kept secret at the frontier of system programming right now is the Linux 4.18+ (c. 2018) concept of restartable sequences or rseq for short. They allow you to create thread-safe data structures without locks or atomics which scale to microprocessors with many cores. 目前系统编程领域最不为人知的秘密,莫过于 Linux 4.18+(约 2018 年)引入的“可重启序列”(Restartable Sequences,简称 rseq)。它允许你在不使用锁或原子操作的情况下创建线程安全的数据结构,并能很好地扩展到拥有多核心的微处理器上。
It’s currently only possible to use rseq on Linux using handwritten assembly code. However I believe in the future, all operating systems will be updated to support rseq(), all system programming languages will be redesigned to be able to express restartable sequences, and all data structure libraries will be rewritten to use them. 目前,在 Linux 上使用 rseq 只能通过手写汇编代码来实现。但我相信在未来,所有操作系统都将更新以支持 rseq(),所有系统编程语言都将被重新设计以支持表达可重启序列,且所有数据结构库都将被重写以利用这一特性。
So far the only software I’ve seen using rseq is tcmalloc, jemalloc, glibc, and cosmopolitan. That’s destined to change now that microprocessors with 128 or even 192 cores are becoming inexpensive. 到目前为止,我所见过的使用 rseq 的软件仅有 tcmalloc、jemalloc、glibc 和 cosmopolitan。随着拥有 128 甚至 192 核心的微处理器变得越来越便宜,这种情况注定会改变。
For example, On my $160 Raspberry Pi 5 (which has 4 cores), rseq makes my malloc() implementation 3x faster versus having a dlmalloc mspace assigned to each thread. For most developers, that’s a take it or leave it kind of improvement. 例如,在我 160 美元的树莓派 5(4 核心)上,与为每个线程分配一个 dlmalloc mspace 相比,rseq 使我的 malloc() 实现速度提升了 3 倍。对于大多数开发者来说,这种提升可有可无。
However, On my $4,834 System76 Thelio Astra with Ampere’s 128 core 3GHz Altra CPU, rseq makes cosmopolitan malloc() go 34x faster (compared to sharding ops over an array of mspaces using sched_getcpu()%32). On my $17,628.55 AMD Threadripper Pro 7995WX with 96 cores, rseq makes my malloc() 43x faster (versus using that same sched_getcpu() mutex sharding technique). 然而,在我那台价值 4,834 美元、搭载 Ampere 128 核心 3GHz Altra CPU 的 System76 Thelio Astra 上,rseq 让 cosmopolitan malloc() 的速度提升了 34 倍(对比使用 sched_getcpu()%32 在 mspace 数组上进行分片操作)。在我那台价值 17,628.55 美元、拥有 96 核心的 AMD Threadripper Pro 7995WX 上,rseq 让我的 malloc() 速度提升了 43 倍(对比使用相同的 sched_getcpu() 互斥锁分片技术)。
System programmers who don’t have a workstation like the ones above are going to be left behind like a dinosaur, with no opportunity to pluck the low hanging fruit of 10x performance optimizations. 没有上述这类工作站的系统程序员将会像恐龙一样被时代淘汰,失去获取 10 倍性能优化这一“低垂果实”的机会。
For example, I wouldn’t have been able to pull off the speedups I made to matrix multiplication last year if I hadn’t splurged on a 96 core CPU. It put me in the poor house for a few months (since the cheaper Ampere workstations weren’t available at the time) but was so worth it, since my work received press coverage, it made me famous in the AI community, it helped my project get adopted by 32% of organizations, and even earned me a job offer from Google to work in their Gradient Canopy improving TPU performance for Gemini. 例如,如果我去年没有斥巨资购买 96 核心的 CPU,我就无法实现矩阵乘法的性能提升。这让我拮据了几个月(因为当时还没有更便宜的 Ampere 工作站),但这一切都是值得的:我的工作获得了媒体报道,使我在 AI 社区声名鹊起,帮助我的项目被 32% 的组织采用,甚至让我获得了谷歌的录用通知,去他们的 Gradient Canopy 部门为 Gemini 优化 TPU 性能。
If you do have one of these microprocessors, then restartable sequences are going to be one of the most important tricks you’ll use to exploit its capabilities. This tutorial will show you how they work, and provide you with a concrete example for pushing and popping which can be immediately useful. 如果你确实拥有这些微处理器之一,那么可重启序列将是你挖掘其性能潜力时最重要的技巧之一。本教程将向你展示它们的工作原理,并提供一个可立即投入使用的入栈(push)和出栈(pop)具体示例。
What Problems Do Restartable Sequences Solve?
可重启序列解决了什么问题?
Whenever the Cosmopolitan C runtime creates a thread on a Linux system, it issues an rseq() system call which gives the kernel 32 bytes of TLS memory. Then, for the remainder of that thread’s life, the kernel will update the TLS memory with the CPU number whenever the thread is rescheduled. 每当 Cosmopolitan C 运行时在 Linux 系统上创建一个线程时,它都会发起一个 rseq() 系统调用,向内核提供 32 字节的 TLS(线程本地存储)内存。此后,在该线程的整个生命周期内,每当线程被重新调度时,内核都会用当前的 CPU 编号更新这块 TLS 内存。
I found that to be immediately helpful for improving my sched_getcpu() implementation. Since now it just needs a 1 nanosecond relaxed mov instruction to get the CPU number, whereas before I needed to wait an entire microsecond for the getcpu() system call. 我发现这对于改进我的 sched_getcpu() 实现非常有帮助。现在只需要一条 1 纳秒的 relaxed mov 指令就能获取 CPU 编号,而以前我需要等待整整一微秒来执行 getcpu() 系统调用。
However it gets better. There’s a second field in the rseq TLS memory that allows the thread to send information back to the kernel. Normally the rseq_cs field is NULL, but it can be updated with a pointer specifying a sequence of assembly instructions in your program. 但这还不是全部。rseq TLS 内存中还有第二个字段,允许线程向内核回传信息。通常 rseq_cs 字段为空,但它可以被更新为一个指针,指向程序中的一段汇编指令序列。
Now, whenever the kernel preempts your thread and tries to move it to a different CPU, it’ll notice your rseq_cs is non-null, and will check the program counter (a.k.a. %rip on x86) to see if it’s currently within the specified interval. If that’s the case, then the kernel will force the thread to jump to an abort handler you also specify, which can do things like jump back to the beginning of the function to retry the operation. 现在,每当内核抢占你的线程并试图将其移动到另一个 CPU 时,它会检查你的 rseq_cs 是否非空,并检查程序计数器(x86 上称为 %rip)是否处于指定的区间内。如果是,内核将强制线程跳转到你指定的异常处理程序(abort handler),该程序可以执行诸如跳回函数开头以重试操作等逻辑。
Here’s why we need that. Let’s say you have a GIL like this: 这就是我们需要它的原因。假设你有一个像这样的全局解释器锁(GIL):
static pthread_mutex_t lock;
static struct List *list;
If you’re using that to protect your data structures, then it’s going to go slow on systems with dozens of cores, since only a single thread can hold the lock at any given moment. So you might think, let’s create a lockless list using atomics. That’s pretty simple if we’re only pushing, but if we want to also be able to pop, then we’d need to account for the ABA problem with something like the following: 如果你用它来保护数据结构,那么在拥有数十个核心的系统上,它的运行速度会很慢,因为在任何给定时刻只有一个线程能持有锁。所以你可能会想,让我们用原子操作创建一个无锁链表。如果只是入栈,这很简单;但如果我们还想实现出栈,就需要考虑 ABA 问题,代码可能如下所示:
(Code block omitted for brevity)
The issue is this will likely go just as slow if not slower. The mere act of sharing the same 64-byte region of memory (a.k.a. cacheline) between multiple cores, causes the CPU internally to basically use a mutex, and chances are the CPU’s internal mutexes aren’t as good as the ones you’ve implemented in userspace. 问题在于,这种做法的速度很可能同样慢,甚至更慢。仅仅是在多个核心之间共享同一块 64 字节的内存区域(即缓存行),就会导致 CPU 在内部基本上使用互斥锁,而且 CPU 内部的互斥锁很可能不如你在用户空间实现的那些高效。
So one potentially smarter approach is to shard the data structure, so each CPU gets its own area. 因此,一个更聪明的做法是对数据结构进行分片,让每个 CPU 拥有自己的区域。
static struct {
alignas(64) struct List *list;
} lists[CPU_SETSIZE];
Then all you have to do is index the lists array using sched_getcpu(). The issue is that doesn’t work. We still nee… 然后你只需要使用 sched_getcpu() 对 lists 数组进行索引即可。但问题是,这行不通。我们仍然需要……