WATaBoy: JIT-ing Game Boy Instructions to Wasm Beats a Native Interpreter

WATaBoy: JIT-ing Game Boy Instructions to Wasm Beats a Native Interpreter

Background

This text assumes the reader is familiar with the concept of just-in-time compilation. Dolphin isn’t on iOS, because you can’t do JIT compilation on iOS. That’s a quick summary of OatmealDome’s blog post “Why Dolphin Isn’t Coming to the App Store”. Ever since reading that, I’ve wondered what it would take to get a CPU-bound emulator like Dolphin working on iOS. Do we just… have to wait a few years for iPhone CPUs to get fast enough to run Dolphin with an interpreter?

本文假设读者熟悉即时编译(JIT)的概念。Dolphin 模拟器无法在 iOS 上运行,因为 iOS 不允许进行 JIT 编译。这是对 OatmealDome 的博文《为什么 Dolphin 不会登陆 App Store》的简要总结。自读到那篇文章起,我就一直在思考,要让像 Dolphin 这样受限于 CPU 的模拟器在 iOS 上运行需要做些什么。难道我们只能……等上几年,直到 iPhone 的 CPU 足够强大,能够通过解释器运行 Dolphin 吗?

Well, Apple has one exception to its JIT restrictions: web browsers. JavaScriptCore, WebKit’s JS engine, uses JIT compilation for its higher-performance tiers. So, if a JS function is called enough times, eventually it’ll be optimised and compiled into native machine code. The same is true for WebAssembly. So, what if we just piggyback off of this? Instead of generating native machine code directly, we could just generate Wasm bytecode, which will eventually be compiled to native machine code by the web browser.

好在苹果对 JIT 限制有一个例外:Web 浏览器。WebKit 的 JS 引擎 JavaScriptCore 在其高性能层级中使用了 JIT 编译。因此,如果一个 JS 函数被调用的次数足够多,它最终会被优化并编译成原生机器码。WebAssembly 也是如此。那么,我们何不借此机会呢?与其直接生成原生机器码,我们完全可以生成 Wasm 字节码,然后由 Web 浏览器将其编译为原生机器码。

After reading Andy Wingo’s blog post “just-in-time code generation within webassembly”, I knew such a thing would be possible. In fact, a handful of projects already use this technique, namely The Jiterpreter and v86, but at the time of writing, no emulators for game consoles have used it, and nobody has compared the performance to an interpreter running natively to see if it’s faster. So, for my undergraduate final-year project, I decided I’d build a Game Boy emulator, first using an interpreter, and then using a JIT-to-Wasm. This project primarily serves as a proof of concept and benchmark to compare the performance of each approach.

在阅读了 Andy Wingo 的博文《WebAssembly 中的即时代码生成》后,我知道这是可行的。事实上,已经有一些项目使用了这种技术,例如 The Jiterpreter 和 v86,但在撰写本文时,还没有游戏机模拟器使用过它,也没有人将其性能与原生运行的解释器进行对比,以验证其是否更快。因此,作为我的本科毕业设计,我决定构建一个 Game Boy 模拟器,先使用解释器,然后再使用 JIT-to-Wasm。该项目主要作为概念验证和基准测试,用于比较两种方法的性能。

For the rest of this blog post, I’ll call this a “JIT-to-Wasm” instead of a “Wasm JIT” to avoid confusion with what the JS engine itself does (recompile Wasm to machine code). Anyone reading this who knows a bit about emulation just rolled their eyes, because how the hell is a Game Boy emulator going to benefit from JIT compilation? Luckily, GameRoy’s blog post describes exactly how it’s possible while remaining cycle-accurate: predict when interrupts are going to occur whenever a JIT block might be interrupted, fall back to an interpreter lazily evaluate any non-CPU Game Boy components accessed via MMIO.

在本文的后续部分,我将其称为“JIT-to-Wasm”而不是“Wasm JIT”,以避免与 JS 引擎本身的工作(将 Wasm 重新编译为机器码)混淆。任何了解一点模拟技术的人看到这里可能都会翻白眼,因为 Game Boy 模拟器怎么可能从 JIT 编译中获益呢?幸运的是,GameRoy 的博文详细描述了如何在保持周期精确(cycle-accurate)的同时实现这一点:在 JIT 块可能被中断时预测中断发生的时间,回退到解释器,并对通过 MMIO 访问的任何非 CPU Game Boy 组件进行惰性求值。

GameRoy’s JIT only targets x86, but nearly all of its optimisation techniques still apply to our JIT-to-Wasm. Definitely check it out if you’re interested in the nitty-gritty details of the Game Boy emulation side of things; it was a huge inspiration. Still, a Game Boy emulator doesn’t benefit from JIT compilation as much as, say, a sixth-gen console. But it was much faster to make, and actually fit within the scope of my final-year project.

GameRoy 的 JIT 仅针对 x86,但其几乎所有的优化技术都适用于我们的 JIT-to-Wasm。如果你对 Game Boy 模拟的细节感兴趣,一定要去看看;它给了我巨大的启发。尽管如此,Game Boy 模拟器从 JIT 编译中获得的收益不如第六代游戏机那么显著。但它的制作速度快得多,而且确实符合我毕业设计的范围。

Implementation

Now, to narrow the scope of this blog post, I’ll take you through the most broadly applicable part of WATaBoy that I couldn’t find a guide for anywhere else: Wasm codegen and late-linking from within Rust. A lot makes WATaBoy interesting, specifically from a Game Boy emulation perspective (e.g., SIMD tile rendering), but those implementation details deserve separate write-ups (you can also just read WATaBoy’s source, of course). If you aren’t interested, skip to the results.

现在,为了缩小本文的范围,我将带你了解 WATaBoy 中最通用、且我在其他任何地方都找不到指南的部分:在 Rust 中进行 Wasm 代码生成和后期链接(late-linking)。WATaBoy 的许多方面都很有趣,特别是在 Game Boy 模拟方面(例如 SIMD 图块渲染),但这些实现细节值得单独撰写文章(当然,你也可以直接阅读 WATaBoy 的源代码)。如果你不感兴趣,可以直接跳到结果部分。

Normally we’d usually reach for tools like wasm-bindgen and wasm-pack to generate glue code between Rust and JavaScript. But those tools cause some ergonomics issues when working with Wasm at a low level. Instead, I use an approach similar to the one described in ”Rust to WebAssembly the hard way”. This just means we’ll pass data across the Rust-JS boundary via the C ABI, using pointers and buffer lengths instead of JavaScript objects. Just a heads up, you’ll need Nightly Rust, because we’ll use a tiny bit of inline Wasm later. So run: rustup default nightly. To switch back, just run this again but swap ‘nightly’ for ‘stable’.

通常,我们会使用 wasm-bindgenwasm-pack 等工具来生成 Rust 和 JavaScript 之间的胶水代码。但在底层处理 Wasm 时,这些工具会带来一些易用性问题。相反,我使用了一种类似于《Rust to WebAssembly the hard way》中描述的方法。这意味着我们将通过 C ABI 在 Rust-JS 边界传递数据,使用指针和缓冲区长度,而不是 JavaScript 对象。提醒一下,你需要使用 Nightly 版 Rust,因为稍后我们会用到一点内联 Wasm。所以请运行:rustup default nightly。若要切回,只需再次运行此命令并将“nightly”替换为“stable”即可。

Create a new library: cargo new --lib jit-to-wasm. Hey look, we’ve already got some code here: pub fn add(left: u64, right: u64) -> u64 { left + right }. For our simple example, let’s try producing some Wasm bytecode at runtime that does the same thing.

创建一个新库:cargo new --lib jit-to-wasm。看,我们已经有了一些代码:pub fn add(left: u64, right: u64) -> u64 { left + right }。对于我们这个简单的示例,让我们尝试在运行时生成一些执行相同功能的 Wasm 字节码。

Wasm code generation

The wasm-encoder crate will be our only dependency. With it, we can emit the bytes for Wasm instructions using a sort of builder pattern. It wasn’t designed for our JIT use case, so there are some ergonomics issues and a tiny bit of boilerplate, but it definitely beats writing an array of raw bytes by hand. :)

wasm-encoder crate 将是我们唯一的依赖项。有了它,我们可以使用一种构建器模式来发出 Wasm 指令的字节。它并非为我们的 JIT 用例而设计,因此存在一些易用性问题和少量的样板代码,但它绝对比手动编写原始字节数组要好得多。:)