A brief note about slot access cost in Common Lisp

A brief note about slot access cost in Common Lisp

关于 Common Lisp 中槽(slot)访问开销的简要说明

Common Lisp is renowned for its excellent object system CLOS. Its implementation is often accompanied by the Metaobject Protocol that, while it is not part of the standard, allows programmers to customize the system underpinnings in numerous interesting ways. This level of customization doesn’t come without a cost – some CLOS code paths will be slower compared to open-coding equivalent solution while abstaining from using standard objects.

Common Lisp 以其卓越的对象系统 CLOS 而闻名。其实现通常伴随着元对象协议(Metaobject Protocol, MOP),虽然它不是标准的一部分,但允许程序员以多种有趣的方式定制系统底层。这种程度的定制并非没有代价——与不使用标准对象而采用开放编码(open-coding)的等效方案相比,某些 CLOS 代码路径会更慢。

The purpose of this blog post is to draw an intuition of differences between structure objects and standard objects when it comes to accessing their slots. From now on I’m going to refer to structure objects as structures, and standard objects as instances. We could imagine a structure is represented in memory as a tuple (CLASS SLOTS), while an instance is represented as a tuple (CLASS STAMP SLOTS).

本篇博文旨在直观地说明结构对象(structure objects)和标准对象(standard objects)在访问其槽时的区别。从现在起,我将结构对象称为“结构(structures)”,将标准对象称为“实例(instances)”。我们可以想象一个结构在内存中表示为一个元组 (CLASS SLOTS),而一个实例表示为一个元组 (CLASS STAMP SLOTS)。

Modifying the structure class has undefined behavior, while the instance’s class may change. This is why the instance needs to track whether it is up-to-date or obsolete. In our simple scheme that information is represented by a stamp that represents the class generation. Tracking whether the instance is obsolete is important, because the memory layout of slots may change - they may be deleted, added, or moved to different positions.

修改结构类会导致未定义的行为,而实例的类是可以改变的。这就是为什么实例需要跟踪它是否是最新的或已过时的。在我们简单的方案中,该信息由代表类代际(class generation)的“戳(stamp)”来表示。跟踪实例是否过时非常重要,因为槽的内存布局可能会发生变化——它们可能会被删除、添加或移动到不同的位置。

This is convenient for long-running programs without downtime, for incremental development and for image-based workflows - the program may be modified at any time to account for changing requirements, without recompiling it from scratch. But it doesn’t come without a downside. The implementation may conformingly assume, that structure accessors won’t ever change and they may be inlined. That means in particular, that structure access is a simple memory reference.

这对于无需停机的长期运行程序、增量开发以及基于镜像(image-based)的工作流非常方便——程序可以随时修改以适应不断变化的需求,而无需从头开始重新编译。但这并非没有缺点。实现可以合规地假设结构访问器永远不会改变,因此它们可以被内联。这意味着,结构访问本质上只是一个简单的内存引用。

(declaim (inline structure-reader-a))
(defun structure-reader-a (object)
  (svref (%slots object) 3))

While with the object we can’t assume that, because we need to check whether the object is not obsolete (at the very least), and because readers are more generic functions - another level of flexibility. Inlining generic functions is hard, because new methods may be added at runtime and the effective method can change. Moreover there may be different classes that have same reader names, so we need to include a piece of code that uses the correct class layout for an instance.

而对于对象,我们不能做这种假设,因为我们至少需要检查对象是否已过时,而且读取器(readers)通常是泛型函数——这带来了另一层灵活性。内联泛型函数很困难,因为新的方法可能在运行时被添加,且有效方法(effective method)可能会改变。此外,不同的类可能具有相同的读取器名称,因此我们需要包含一段代码,为实例使用正确的类布局。

This is why calling instance readers involves: calling a function (can’t be inlined), finding the memory layout (dispatch), and verifying whether the instance is up-to-date. That is exemplified by the following pseudocode that ignores other generic function intrinsics. Depending on the implementation of generic functions, the test for obsolete instances may be evaded when instances are not obsolete.

这就是为什么调用实例读取器涉及:调用函数(无法内联)、查找内存布局(分发/dispatch)以及验证实例是否为最新。以下伪代码展示了这一点(忽略了其他泛型函数的内在机制)。根据泛型函数的实现,当实例未过时时,对过时实例的检查可能会被跳过。

(declaim (notinline instance-reader-a))
(define-reader-function instance-reader-a (object)
  (unless (%up-to-date-p object)
    ;; Among other things updates indexes for memory accesses.
    ;; This is a slow path.
    (%recompile-reader-function #'instance-reader-a)
    (return-from instance-reader-a (instance-reader-a object)))
  (typecase object
    (standard-class-a (svref (%slots object) 3))
    (standard-class-b (svref (%slots object) 4))
    (custom-class-c (slot-value object 'a))
    (custom-class-d (slot-value object 'a))
    (otherwise (no-applicable-method #'instance-reader-a object))))

All this is assuming, that we deal with standard readers. Using the metaobject protocol it is possible to store slot values anywhere, most notably not in a vector bundled with the instance, or to add additional preprocessing. I’m not going to touch here much on MOP - it is just to signify, that standard readers for standard classes may directly access the slot vector. At minimum, assuming a single reader and a clever dispatch algorithm:

以上所有假设都是基于我们处理的是标准读取器。使用元对象协议,可以将槽值存储在任何地方,最显著的是不存储在与实例捆绑的向量中,或者添加额外的预处理。我在这里不会过多涉及 MOP——这只是为了说明,标准类的标准读取器可以直接访问槽向量。至少,假设有一个单一的读取器和一个巧妙的分发算法:

(declaim (notinline instance-reader-a))
(define-reader-function instance-reader-a (object)
  (if (eql (stamp object) 42)
      (svref (%slots object) 3)
      (if (%up-to-date-p object)
          (no-applicable-method #'instance-reader-a object)
          (progn
            (%recompile-reader-function #'instance-reader-a)
            (return-from instance-reader-a (instance-reader-a object))))))

In other words comparing structure access with instance readers is comparing apples to oranges, because the former is a memory access, while the latter is a function call. SLOT-VALUE will be even slower, because this function is a trampoline to a more involved SLOT-VALUE-USING-CLASS, and to do that we need to: read the object class, find the slot definition in the class, and invoke a generic function SLOT-VALUE-USING-CLASS.

换句话说,将结构访问与实例读取器进行比较就像是拿苹果和橘子做比较,因为前者是内存访问,而后者是函数调用。SLOT-VALUE 会更慢,因为该函数是一个通往更复杂的 SLOT-VALUE-USING-CLASS 的跳板,要做到这一点,我们需要:读取对象类、在类中查找槽定义,并调用泛型函数 SLOT-VALUE-USING-CLASS

The generic function SLOT-VALUE-USING-CLASS may be similar to the reader defined above with a caveat, that it has more arguments to dispatch on (so the dispatch procedure may be more involved). In any case, it is at least as slow as the optimal reader defined above (a single reader for the standard class).

泛型函数 SLOT-VALUE-USING-CLASS 可能类似于上面定义的读取器,但需要注意的是,它有更多的参数需要进行分发(因此分发过程可能更复杂)。无论如何,它至少和上面定义的最佳读取器(针对标准类的单一读取器)一样慢。

(defun slot-value (object slot-name)
  (let* ((class (class-of object))
         (slots (mop:class-slots class))
         (slot (find slot-name slots :key #'mop:slot-definition-name)))
    (mop:slot-value-using-class class object slot)))

Tim Bradshaw recently made a blog post that claims that instance slot access is around 38x slower than structure access, but he compares inlined memory access to generic function dispatch. A fair comparison would use the operator STANDARD-INSTANCE-ACCESS. Metaobject protocol defines an optimized way to access instance slots that does not incur overhead associated with dispatching generic functions by calling the function STANDARD-INSTANCE-ACCESS. This function may be inlined and is similar to structure object accessors.

Tim Bradshaw 最近写了一篇博文,声称实例槽访问比结构访问慢约 38 倍,但他比较的是内联内存访问与泛型函数分发。一个公平的比较应该使用操作符 STANDARD-INSTANCE-ACCESS。元对象协议定义了一种优化方式来访问实例槽,通过调用 STANDARD-INSTANCE-ACCESS 函数,避免了与泛型函数分发相关的开销。该函数可以被内联,并且类似于结构对象访问器。

A possible definition would look like this:

可能的定义如下所示:

(declare (inline mop:standard-instance-access))
(defun mop:standard-instance-access (object location)
  (svref (%slots object) location))

The argument LOCATION is technically an opaque object, but for the illustration purposes we assume that it is an index (it usually is!). Its value may be read using…

参数 LOCATION 在技术上是一个不透明对象,但为了说明起见,我们假设它是一个索引(通常确实如此!)。它的值可以通过……读取。