mimalloc: A new, high-performance, scalable memory allocator for the modern era

mimalloc: A new, high-performance, scalable memory allocator for the modern era

mimalloc:面向现代时代的高性能、可扩展内存分配器

At a glance Today’s critical services and applications are often highly concurrent, using hundreds of threads. They also operate at large memory scales, frequently hundreds of gigabytes, especially when using large language models. mimalloc is an open-source, modern, scalable memory allocator that is a drop-in replacement for malloc and free. It is relatively small (~12K lines), with clear internal data structures, and is easy to build and integrate into other projects. It provides bounded worst-case allocation times (up to OS primitives), bounded space overhead, low internal fragmentation, and minimal contention by relying almost exclusively on atomic operations. mimalloc is available on GitHub and has over 12K stars.

概览:当今的关键服务和应用程序通常具有高度并发性,使用数百个线程。它们还在大规模内存环境下运行,经常达到数百 GB,特别是在使用大型语言模型时。mimalloc 是一个开源、现代且可扩展的内存分配器,可作为 malloc 和 free 的直接替代品。它相对精简(约 1.2 万行代码),具有清晰的内部数据结构,且易于构建并集成到其他项目中。它提供了有界的分配最坏情况时间(取决于操作系统原语)、有界的空间开销、较低的内部碎片,并通过几乎完全依赖原子操作实现了最小的竞争。mimalloc 已在 GitHub 上开源,并拥有超过 1.2 万颗星。

At the RiSE group at Microsoft Research (MSR), we conduct fundamental research into formal methods, programming languages, and software engineering (including emerging agentic systems), with a particular focus on systems that can be provably correct, secure, and performant. The mimalloc memory allocator was initially designed in 2020 as a fast allocator for the state-of-the-art Lean and Koka programming languages developed at RiSE, both of which use novel compiler-guided reference counting (see Perceus).

在微软研究院 (MSR) 的 RiSE 小组,我们对形式化方法、编程语言和软件工程(包括新兴的智能体系统)进行基础研究,特别关注那些可证明正确、安全且高性能的系统。mimalloc 内存分配器最初设计于 2020 年,旨在为 RiSE 开发的最先进的 Lean 和 Koka 编程语言提供快速分配支持,这两种语言都使用了新颖的编译器引导引用计数技术(参见 Perceus)。

The scalable design of mimalloc has also proved to work exceedingly well for large services at Microsoft. Through close cooperation with product teams, mimalloc has significantly improved the response times in services such as Bing. Today, mimalloc is widely used in large services and applications, both within and outside Microsoft. It serves as the allocator for NoGIL CPython 3.13+, is integrated into Unreal Engine, and is used in games such as Death Stranding. The project is open source on GitHub, with over 12K stars; its Rust wrapper alone sees over 100K downloads per day.

mimalloc 的可扩展设计在微软的大型服务中也表现得非常出色。通过与产品团队的密切合作,mimalloc 显著改善了 Bing 等服务的响应时间。如今,mimalloc 已在微软内外部的大型服务和应用程序中得到广泛应用。它不仅是 NoGIL CPython 3.13+ 的分配器,还被集成到虚幻引擎 (Unreal Engine) 中,并用于《死亡搁浅》(Death Stranding) 等游戏中。该项目在 GitHub 上开源,拥有超过 1.2 万颗星;仅其 Rust 封装库每天的下载量就超过 10 万次。

mimalloc is effective across a wide range of scenarios; from small-scale applications like Koka or Lean, to large services with memory footprints exceeding 500 GiB and hundreds of threads. Despite this range, the codebase is still compact, at around 12K lines of C. Reflecting its research origins, mimalloc emphasizes clear internal data structures with strong invariants, making it easier to understand and reason about than many industry allocators. As Fred Brooks already remarked in his famous book The Mythical Man-Month: “Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t need your flowchart; it’ll be obvious.”

mimalloc 在各种场景下都非常有效;从 Koka 或 Lean 等小规模应用,到内存占用超过 500 GiB 且拥有数百个线程的大型服务。尽管应用范围广泛,但其代码库依然紧凑,仅约 1.2 万行 C 代码。反映其研究背景的是,mimalloc 强调具有强不变性的清晰内部数据结构,使其比许多工业级分配器更容易理解和推导。正如 Fred Brooks 在其名著《人月神话》中所言:“给我看你的流程图而隐藏你的表格,我仍会感到困惑。给我看你的表格,我就不需要流程图了;一切将显而易见。”

As a result, mimalloc has been ported to many platforms—Windows, macOS, Linux, FreeBSD, NetBSD, DragonFly, and various consoles—, and is easy to build and integrate into other projects. For example, the clear data structures enabled Sam Gross and others to adopt mimalloc as the concurrent allocator for NoGIL CPython. The design also makes it relatively straightforward to implement cyclic garbage collection on top of this.

因此,mimalloc 已被移植到许多平台——Windows、macOS、Linux、FreeBSD、NetBSD、DragonFly 以及各种游戏主机——并且易于构建和集成到其他项目中。例如,清晰的数据结构使 Sam Gross 等人能够采用 mimalloc 作为 NoGIL CPython 的并发分配器。这种设计也使得在其之上实现循环垃圾回收变得相对简单。

The Fast Path / 快速路径

As with other scalable allocators (such as tcmalloc and jemalloc), a core design principle of mimalloc is that each thread maintains its own thread-local heap, which we call a “theap”. Each theap owns a set of mimalloc “pages,” which are usually 64 KiB. Each mimalloc page contains blocks of a fixed size, organized into size classes to reduce internal fragmentation. By giving each thread its own theap and set of mimalloc pages, memory allocation and deallocation typically proceed without synchronization. Atomic operations are only required when a thread frees a block allocated by another thread.

与其他可扩展分配器(如 tcmalloc 和 jemalloc)一样,mimalloc 的核心设计原则是每个线程维护自己的线程本地堆,我们称之为“theap”。每个 theap 拥有一组 mimalloc “页面”,通常为 64 KiB。每个 mimalloc 页面包含固定大小的块,并按大小类别组织以减少内部碎片。通过为每个线程提供自己的 theap 和一组 mimalloc 页面,内存分配和释放通常无需同步即可进行。只有当一个线程释放由另一个线程分配的块时,才需要原子操作。

Moreover, in practice, most allocations are quite small, often less than 1 KiB. For such small allocations, mimalloc provides a fast path where the main allocation function looks like:

此外,在实践中,大多数分配都非常小,通常小于 1 KiB。对于此类小额分配,mimalloc 提供了一条快速路径,其主要分配函数如下所示:

void* mi_malloc( size_t size ) {
  mi_theap_t* const theap = mi_get_thread_local_theap();
  if (size > MI_MAX_SMALL_SIZE) return mi_malloc_generic(theap,size); // slow generic path
  const size_t index = (size + sizeof(void*))/sizeof(void*); // round size
  mi_page_t* const page = theap->small_pages[index];
  mi_block_t* const block = page->free; // head of free list
  if (block == NULL) return mi_malloc_generic(theap,size); // slow generic path
  page->free = block->next; // pop free list
  page->used++;
  return block;
}

By using thread-local theaps, we need no atomic operations or thread synchronization. We also try to minimize the number of branches. In particular, the thread-local theap is never NULL, and we initialize it with a special empty theap with all empty pages. This way, we do not need a separate check if the theap is NULL. Similarly, the pointers in the small_pages array are never NULL, and we use again special empty pages (with page->free==NULL) to avoid a separate check. Finally, pages are initialized with a free list rather than a separate bump pointer, avoiding special cases and enabling allocation by simply popping blocks from the free list.

通过使用线程本地的 theap,我们不需要原子操作或线程同步。我们还尽量减少分支数量。特别是,线程本地的 theap 永远不会为 NULL,我们使用一个带有所有空页面的特殊空 theap 对其进行初始化。这样,我们就不需要单独检查 theap 是否为 NULL。同样,small_pages 数组中的指针也永远不会为 NULL,我们再次使用特殊的空页面(page->free==NULL)来避免单独检查。最后,页面使用空闲列表而不是单独的指针碰撞 (bump pointer) 进行初始化,从而避免了特殊情况,并能够通过简单地从空闲列表中弹出块来实现分配。

On x64, this code now translates into few instructions with just two uncommon branches:

在 x64 架构上,这段代码现在被转换为极少数指令,且仅包含两个不常见的分支:

mi_malloc:
  movq %rdi, %rsi ; rsi = size
  movq _mi_theap_default@GOTTPOFF(%rip), %rax
  movq %fs:(%rax), %rdi ; rdi = thread local theap
  cmpq $1024, %rsi ; size > MI_MAX_SMALL_SIZE?
  ja .LBB0_generic
  leaq 7(%rsi), %rax ; round to sizeof(void*)
  andq $-8, %rax
  movq 232(%rdi,%rax), %rcx ; rcx = heap->small_pages[index]
  movq 8(%rcx), %rax ; block = rax = page->free
  testq %rax, %rax ; block == NULL?
  je .LBB0_generic
  movq (%rax), %rdx ; page->free = block->next
  movq %rdx, 8(%rcx)
  incw 16(%rcx) ; page->used++
  retq
.LBB0_generic:
  jmp _mi_malloc_generic@PLT ; tailcall

Similarly, mimalloc provides a fast path for freeing blocks. In practice, most blocks are freed by the same thread that allocated the block. We can optimize that case by checking whether the current thread ID matches the thread ID stored in the corresponding mimalloc page. If so, we can just push our block on the page’s free list without requiring atomic operations or locks:

同样,mimalloc 也为释放块提供了快速路径。在实践中,大多数块都是由分配该块的同一个线程释放的。我们可以通过检查当前线程 ID 是否与存储在相应 mimalloc 页面中的线程 ID 匹配来优化这种情况。如果是这样,我们只需将块推入页面的空闲列表,而无需原子操作或锁: