The time the x86 emulator team found code so bad that they fixed it during emulation
The time the x86 emulator team found code so bad that they fixed it during emulation
x86 模拟器团队曾发现一段代码烂到极致,以至于他们在模拟过程中直接将其修复
During an exchange of war stories, a colleague of mine told one from back in the days when Windows included a processor emulator for x86-32 on systems that natively ran some other processor. (This has happened many times. And no, I don’t know which processor this particular story applied to.) 在一次交流“战斗故事”时,我的一位同事讲了一段往事:那时 Windows 在某些原生运行其他处理器的系统上,内置了一个 x86-32 处理器模拟器。(这种情况发生过很多次。不,我不知道这个故事具体指的是哪种处理器。)
This particular emulator employed binary translation, generating native code to perform the equivalent operations of the original x86-32 code. This offered a significant performance improvement over emulation via interpreter. You can imagine that x86-32 is just a bytecode, and the emulator is a JIT compiler. 这个特定的模拟器采用了二进制翻译技术,通过生成原生代码来执行与原始 x86-32 代码等效的操作。与通过解释器进行的模拟相比,这带来了显著的性能提升。你可以把 x86-32 想象成一种字节码,而模拟器就是一个 JIT(即时)编译器。
Anyway, my colleague found that there was one program that needed to allocate around 64KB of memory on the stack and initialize it. The standard way of doing this is to perform a stack probe to ensure that 64KB of memory is available, then subtracting 65536 from the stack pointer, and then initializing the memory in a small, tight loop. 总之,我的同事发现有一个程序需要在栈上分配约 64KB 的内存并进行初始化。通常的做法是执行一次栈探测(stack probe)以确保 64KB 内存可用,然后从栈指针中减去 65536,最后在一个紧凑的小循环中初始化内存。
But using a loop to initialize the memory was too mundane for whatever compiler was used to compile this code. Instead of generating a loop to initialize each byte of the buffer, the compiler “optimized” the code by unrolling the loop into 65,536 individual “write byte to memory” instructions, each 4 bytes long. 但对于编译这段代码的编译器来说,使用循环来初始化内存显得太“平庸”了。编译器没有生成一个循环来初始化缓冲区的每个字节,而是通过将循环展开为 65,536 条独立的“写入字节到内存”指令来“优化”了代码,每条指令长 4 字节。
All in all, it took this program 256 kilobytes of code to initialize 64 kilobytes of data. This offended the team so much that they added special code to the translator to detect this horrible function and replace it with the equivalent tight loop. 总而言之,这个程序竟然用了 256KB 的代码来初始化 64KB 的数据。这让团队感到非常不可思议,以至于他们专门在翻译器中添加了代码,用于检测这种糟糕的函数,并将其替换为等效的紧凑循环。
We did this a lot in the Xbox emulator on Xbox 360. Every Ubisoft Xbox game scanned the entire DVD twice to figure out its localization; some games had multithreading or memory management bugs; no games took advantage of WriteScatter/ReadGather; so many other obvious inefficiencies and bugs. Eventually, we got it so games loaded faster in the emulator than on the original console! 我们在 Xbox 360 的 Xbox 模拟器上也经常这样做。育碧的每一款 Xbox 游戏都会扫描整个 DVD 两次来确定其本地化信息;有些游戏存在多线程或内存管理漏洞;没有游戏利用 WriteScatter/ReadGather 功能;还有许多其他明显的低效和 Bug。最终,我们实现了让游戏在模拟器中的加载速度比在原版主机上还要快!
So nothing has changed in terms of game optimization 😄. Even games on medium settings are bringing top-of-the-line GPUs to their knees. That being said, we greatly appreciate the efforts you all on the emulation team are doing. 🙂 所以,游戏优化方面什么都没变 😄。即使是中等画质的游戏,也能让顶级的 GPU 跪地求饶。话虽如此,我们非常感谢你们模拟器团队所做的一切努力。🙂
Technical Context / 技术背景
Alpha — DIGITAL FX!32 (Windows NT 4.0, ~1996–97): Ran 32-bit x86 Win32 apps on DEC Alpha NT. It was a DEC product bundled alongside Windows, and it was profile-directed — it emulated first, logged an execution profile, then translated to native Alpha code in a background pass. Alpha — DIGITAL FX!32 (Windows NT 4.0, ~1996–97): 在 DEC Alpha NT 上运行 32 位 x86 Win32 应用。这是随 Windows 捆绑销售的 DEC 产品,采用配置导向(profile-directed)模式——先进行模拟,记录执行配置,然后在后台处理中将其翻译为 Alpha 原生代码。
Itanium — IA-32 Execution Layer (Windows Server 2003 SP1 / XP 64-bit for Itanium, ~2003–06): Intel’s software dynamic binary translator that shipped with Itanium-based operating systems. It converted IA-32 instructions into Itanium instructions via dynamic translation, replacing the slow on-chip hardware x86 emulation. This is true JIT-style two-phase translation built into the OS. Itanium — IA-32 执行层 (Windows Server 2003 SP1 / XP 64-bit for Itanium, ~2003–06): 英特尔随基于 Itanium 的操作系统发布的软件动态二进制翻译器。它通过动态翻译将 IA-32 指令转换为 Itanium 指令,取代了缓慢的片上硬件 x86 模拟。这是集成在操作系统中的真正的 JIT 式两阶段翻译。
Why did older versions of Windows include x86-32 emulation? NT was designed from the start with portability in mind. It was developed first on a non-x86 chip and then ported to the 80386. Windows NT historically supported a variety of non-x86 architectures (Intel 860, DEC Alpha, MIPS, PowerPC, etc.), while consumer Windows (3.11, 95, 98) was x86-32 only. NT included x86-32 emulation so that you could run binaries built for consumer Windows on your expensive non-x86 Windows NT workstation. 为什么旧版本的 Windows 会包含 x86-32 模拟代码? NT 从设计之初就考虑了可移植性。它最初是在非 x86 芯片上开发的,随后才移植到 80386 上。Windows NT 在历史上支持多种非 x86 架构(Intel 860、DEC Alpha、MIPS、PowerPC 等),而同时期的消费级 Windows(3.11、95、98)仅支持 x86-32。NT 包含 x86-32 模拟是为了让你能在昂贵的非 x86 Windows NT 工作站上运行为消费级 Windows 构建的二进制程序。