Async Rust never left the MVP state
Async Rust never left the MVP state
May 4, 2026 | Dion | Embedded software engineer
I’ve previously explained async bloat and some work-arounds for it, but would much prefer to solve the issue at the root, in the compiler. I’ve submitted a Project Goal, and am looking for help to fund the effort. 我之前解释过异步(async)代码膨胀及其一些变通方法,但我更倾向于从根源上解决这个问题,即在编译器层面进行优化。我已经提交了一个项目目标(Project Goal),并正在寻求资金支持以推进这项工作。
I love me some async Rust! It’s amazing how we can write executor agnostic code that can run concurrently on huge servers and tiny microcontrollers. 我非常喜欢异步 Rust!令人惊叹的是,我们可以编写与执行器(executor)无关的代码,并使其在大型服务器和微型微控制器上并发运行。
But especially on those tiny microcontrollers we notice that async Rust is far from the zero cost abstractions we were promised. That’s because every byte of binary size counts and async introduces a lot of bloat. This bloat exists on desktops and servers as well, but it’s much less noticable when you have substantially more memory and compute available. 但特别是在那些微型微控制器上,我们注意到异步 Rust 远未达到我们所承诺的“零成本抽象”。这是因为二进制文件的每一个字节都至关重要,而异步引入了大量的膨胀。这种膨胀在桌面端和服务器端同样存在,但当你拥有充足的内存和计算资源时,它就不那么明显了。
I’ve previously explained some work-arounds for this issue, but would much prefer to get to the root of the problem, and work on improving async bloat in the compiler. As such I have submitted a Project Goal. 我之前解释过一些针对此问题的变通方法,但我更希望深入问题的根源,致力于在编译器中改善异步膨胀问题。因此,我提交了一个项目目标。
This is part 2 of my blog series on this topic. See part 1 for the initial exploration of the topic and what you can do when writing async code to avoid some of the bloat. In this second part we’ll dive into the internals and translate the methods of blog 1 into optimizations for the compiler. 这是我关于该主题博客系列的第二部分。请参阅第一部分以了解该主题的初步探索,以及在编写异步代码时如何避免部分膨胀。在第二部分中,我们将深入探讨内部机制,并将第一篇博客中的方法转化为编译器的优化手段。
What I won’t be talking about is the often discussed problem of futures becoming bigger than necessary and them doing a lot of copying. People are aware of that already. In fact, there is an open PR that tackles part of it: https://github.com/rust-lang/rust/pull/135527 我不会讨论那个经常被提及的问题,即 Future 变得比预期更大以及它们进行了大量拷贝。人们已经意识到了这一点。事实上,目前有一个正在进行的 PR 正在解决部分相关问题:https://github.com/rust-lang/rust/pull/135527
Anatomy of a generated future
生成的 Future 剖析
We’re going to be looking at this code: 我们将研究这段代码:
fn foo() -> impl Future<Output = i32> { async { 5 } }
fn bar() -> impl Future<Output = i32> { async { foo().await + foo().await } }
godbolt
We’re using the desugared syntax for futures because it’s easier to see what’s happening. 我们使用 Future 的去糖(desugared)语法,因为这样更容易观察其内部运作。
So what does the bar future look like?
那么 bar 的 Future 是什么样的呢?
There are two await points, so the state machine must have at least two states, right? 这里有两个 await 点,所以状态机至少应该有两个状态,对吧?
Well, yes. But there’s more. 嗯,是的。但不仅如此。
Luckily we can ask the compiler to dump MIR for us at various passes. An interesting pass is the coroutine_resume pass. This is the last async-specific MIR pass. Why is this important? Well, async is a language feature that still exists in MIR, but not in LLVM IR. So the transformation of async to state machine happens as a MIR pass.
幸运的是,我们可以要求编译器在不同的阶段(pass)为我们转储 MIR。一个有趣的阶段是 coroutine_resume 阶段。这是最后一个特定于异步的 MIR 阶段。为什么这很重要?因为异步是一个在 MIR 中仍然存在,但在 LLVM IR 中不再存在的语言特性。因此,异步到状态机的转换是在 MIR 阶段完成的。
The bar function generates 360 lines of MIR. Pretty crazy, right? Although this gets optimized somewhat later on, the non-async version uses only 23 lines for this.
bar 函数生成了 360 行 MIR 代码。很疯狂,对吧?虽然之后会进行一些优化,但非异步版本仅需 23 行代码即可实现。
The compiler also outputs the CoroutineLayout. It’s basically an enum with these states (comments my own):
编译器还会输出 CoroutineLayout。它本质上是一个包含以下状态的枚举(注释为我个人添加):
variant_fields: {
Unresumed(0): [], // Starting state / 初始状态
Returned (1): [],
Panicked (2): [],
Suspend0 (3): [_s1], // At await point 1, _s1 = the foo future / 在 await 点 1,_s1 = foo 的 future
Suspend1 (4): [_s0, _s2], // At await point 2, _s0 = result of _s1, s2 = the second foo future / 在 await 点 2,_s0 = _s1 的结果,s2 = 第二个 foo 的 future
},
So what are Returned and Panicked? 那么 Returned 和 Panicked 是什么?
Well, Future::poll is a safe function. Calling it must not induce any UB, even when the future is done. So after Suspend1 the future returns Ready and the future is changed to the Returned state. Once polled again in that state, the poll function will panic.
Future::poll 是一个安全函数。调用它绝不能引发任何未定义行为(UB),即使在 Future 完成后也是如此。因此,在 Suspend1 之后,Future 返回 Ready,并将状态更改为 Returned。一旦在该状态下再次被 poll,poll 函数就会触发 panic。
The Panicked state exists so that after an async fn has panicked, but the catch-unwind mechanism was used to catch it, the future can’t be polled anymore. Polling a future in the Panicked state will panic. If this mechanism wasn’t there, we could poll the future again after a panic. But the future may be in an incomplete state and so that could cause UB. This mechanism is very similar to mutex poisoning.
Panicked 状态的存在是为了确保在异步函数发生 panic 且被 catch-unwind 机制捕获后,该 Future 不再被 poll。在 Panicked 状态下 poll 一个 Future 会导致 panic。如果没有这个机制,我们可能会在 panic 后再次 poll 该 Future。但此时 Future 可能处于不完整状态,从而导致 UB。这种机制与互斥锁中毒(mutex poisoning)非常相似。
(I’m 90% sure I’m correct about the Panicked state, but I can’t really find any docs that actually describe this.)
(我有 90% 的把握确定关于 Panicked 状态的理解是正确的,但我确实找不到任何实际描述此机制的文档。)
Cool, this seems reasonable. 酷,这看起来很合理。
Why panic?
为什么要 panic?
But is it reasonable? Futures in the Returned state will panic. But they don’t have to. The only thing we can’t do is cause UB to happen.
但这真的合理吗?处于 Returned 状态的 Future 会触发 panic。但它们本不必如此。我们唯一不能做的是引发 UB。
Panics are relatively expensive. They introduce a path with a side-effect that’s not easily optimized out. What if instead, we just return Pending again? Nothing unsafe going on, so we fulfill the contract of the Future type.
Panic 的代价相对较高。它们引入了一条带有副作用的路径,且不容易被优化掉。如果我们改为再次返回 Pending 会怎样?这不会有任何不安全的操作,因此我们依然满足了 Future 类型的契约。
I’ve hacked this in the compiler to try it out and saw a 2%-5% reduction in binary size for async embedded firmware. 我在编译器中对此进行了修改以进行测试,结果发现异步嵌入式固件的二进制文件大小减少了 2%-5%。
So I propose this should be a switch, just like overflow-checks = false is for integer overflow. In debug builds it would still panic so that wrong behavior is immediately visible, but in release builds we get smaller futures.
因此,我建议将其作为一个开关,就像整数溢出的 overflow-checks = false 一样。在调试构建中,它仍然会 panic,以便错误行为能被立即发现;但在发布构建中,我们可以获得更小的 Future。
Similarly, when panic=abort is used, we might be able to get rid of the Panicked state altogether. I want to look into the repercussions of that.
同样地,当使用 panic=abort 时,我们或许可以完全移除 Panicked 状态。我想研究一下这样做的后果。
Always a state machine
永远的状态机
We’ve looked at bar, but not yet at foo.
我们已经看了 bar,但还没看 foo。
fn foo() -> impl Future<Output = i32> { async { 5 } }
Let’s implement it manually, to see what the optimal solution would be. 让我们手动实现它,看看最优解是什么样的。
struct FooFut;
impl Future for FooFut {
type Output = i32;
fn poll(self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
Poll::Ready(5)
}
}
Easy right? We don’t need any state. We just return the number. 很简单,对吧?我们不需要任何状态,只需返回数字即可。
Let’s see what the generated MIR is for the version the compiler gives us: 让我们看看编译器为我们生成的 MIR 版本:
// MIR for `foo::{closure#0}`
0 coroutine_resume /* coroutine_layout = CoroutineLayout {
field_tys: {},
variant_fields: {
Unresumed(0): [],
Returned (1): [],
Panicked (2): [],
},
storage_conflicts: BitMatrix(0x0) {},
} */
fn foo::{closure#0}(_1: Pin<&mut {async block@src\main.rs:5:5: 5:10}>, _2: &mut Context<'_>) -> Poll<i32> {
debug _task_context => _2;
let mut _0: core::task::Poll<i32>;
let mut _3: i32;
let mut _4: u32;
let mut _5: &mut {async block@src\main.rs:5:5: 5:10};
bb0: {
_5 = copy (_1.0: &mut {async block@src\main.rs:5:5: 5:10});
_4 = discriminant((*_5));
switchInt(move _4) -> [0: bb1, 1: bb4, otherwise: bb5];
}
bb1: {
_3 = const 5_i32;
goto -> bb3;
}
bb2: {
_0 = Poll::<i32>::Ready(move _3);
discriminant((*_5)) = 1;
return;
}
bb3: {
goto -> bb2;
}
bb4: {
assert(const false, "`async fn` resumed after completion") -> [success: bb4, unwind unreachable];
}
bb5: {
unreachable;
}
}
Yikes! That’s a lot of code! 天哪!代码量真大!
Notice… 请注意……