The perils of UUID primary keys in SQLite

The perils of UUID primary keys in SQLite

SQLite 中使用 UUID 主键的风险

It’s common to use random UUIDs as a primary key in databases. One of the known downsides of random UUIDs is that their unordered nature (UUID4) can cause a lot of extra paging for the clustered index because you are inserting rows randomly into the Btree and having to re-balance it. This post tries to help us develop a more visceral understanding of the performance cost of all that extra paging. 在数据库中使用随机 UUID 作为主键非常普遍。随机 UUID 的一个已知缺点是其无序性(UUID4),这会导致聚簇索引(clustered index)产生大量额外的分页(paging),因为你是在向 B 树中随机插入行,并不得不对其进行重新平衡。本文旨在帮助我们更直观地理解这些额外分页所带来的性能代价。

While this post is about SQLite specifically, the problem of random UUIDs also extends to other databases that use clustered indexes. 虽然本文专门讨论 SQLite,但随机 UUID 的问题同样适用于其他使用聚簇索引的数据库。

What is a clustered index?

什么是聚簇索引?

A clustered index determines the physical storage order of the rows in a table. Because of this: 聚簇索引决定了表中行的物理存储顺序。因此:

  • There can be only one clustered index per table (rows can only be physically sorted one way).
  • 每个表只能有一个聚簇索引(行在物理上只能按一种方式排序)。
  • A clustered index is the table.
  • 聚簇索引即表本身。

A non-clustered index, by contrast, stores only the indexed columns plus a pointer to the actual row data, which lives elsewhere. 相比之下,非聚簇索引仅存储索引列以及指向实际行数据的指针,而实际数据存储在其他地方。

Rowid

Rowid

Every ordinary SQLite table has an implicit 64-bit integer primary key called rowid. The table’s data is stored in a B-tree structure containing one entry for each table row, using the rowid value as the key. This is effectively SQLite’s clustered index. The physical storage order of rows follows rowid sequence. 每个普通的 SQLite 表都有一个隐式的 64 位整数主键,称为 rowid。表的数据存储在 B 树结构中,其中包含每个表行的条目,并使用 rowid 值作为键。这实际上就是 SQLite 的聚簇索引。行的物理存储顺序遵循 rowid 的序列。

Without rowid

无 Rowid 表

SQLite also supports WITHOUT ROWID tables. These tables have no implicit rowid. Instead, the primary key you declare becomes the clustered index. SQLite 还支持 WITHOUT ROWID 表。这些表没有隐式的 rowid。相反,你声明的主键将成为聚簇索引。

Note: In SQLite rowid tables are implemented as B-Trees where all content is stored in the leaves of the tree, whereas WITHOUT ROWID tables are implemented using ordinary B-Trees with content stored on both leaves and intermediate nodes. 注意:在 SQLite 中,rowid 表实现为 B+ 树(所有内容存储在叶子节点),而 WITHOUT ROWID 表实现为普通 B 树(内容存储在叶子节点和中间节点中)。


Baseline

基准测试

Let’s establish a performance baseline with regular rowid int primary key. We’ll insert 10 million rows in batches of 1 million. 让我们使用常规的 rowid 整数主键建立性能基准。我们将分 10 批,每批 100 万行,共插入 1000 万行数据。

(Code omitted for brevity)

Results: Roughly a million inserts per second. 结果: 大约每秒插入 100 万行。


UUID4 WITHOUT ROWID

UUID4(无 Rowid)

Now lets try UUID4. 现在让我们尝试 UUID4。

(Code omitted for brevity)

Results: Oh no! What’s happened here the inserts are 14-16x slower?! 结果: 糟糕!发生了什么?插入速度慢了 14-16 倍?!


Profile

性能分析

That’s a big difference. But, lets not guess when we can profile. Below is a normalised diffgraph. A diffgraph compares two profiling snapshots (in this case INTEGER vs UUID4) and displays the differences in a flamegraph structure. 差异巨大。但既然可以进行性能分析,就不要瞎猜。下图是一个归一化的差异图(diffgraph)。差异图比较了两个性能分析快照(本例中为 INTEGER 与 UUID4),并以火焰图结构显示差异。

We can see from the diffgraph that we are spending a lot more time balancing the tree, reading and writing. This is because the unordered nature of UUID4 means they are ordered randomly which is forcing SQLite to constantly re-balance the Btree. 从差异图中可以看出,我们花费了更多时间在树的平衡、读取和写入上。这是因为 UUID4 的无序性意味着它们是随机排序的,这迫使 SQLite 不断地重新平衡 B 树。


UUID7 WITHOUT ROWID

UUID7(无 Rowid)

We can theoretically fix this with UUID7 which is time ordered eliminating the ordering problem of UUID4. Let’s see if this improves things. 理论上,我们可以使用按时间排序的 UUID7 来解决这个问题,从而消除 UUID4 的排序问题。让我们看看这是否能改善情况。

(Code omitted for brevity)

Results: Back to a more reasonable number. Still slower than our baseline. UUID blob primary keys are 16 bytes vs int primary keys which are 8 bytes. 结果: 回到了一个更合理的数值。但仍然比我们的基准测试慢。UUID 二进制主键为 16 字节,而整数主键为 8 字节。


UUID4 WITH ROWID

UUID4(带 Rowid)

Now lets try UUID4 WITH ROWID. This means the hidden rowid will be the clustered index. The upside of this is rowid is sequential. The downside is that the table now has two indexes so there will be write amplification. 现在让我们尝试带 rowid 的 UUID4。这意味着隐藏的 rowid 将成为聚簇索引。这样做的好处是 rowid 是顺序的。缺点是表现在有两个索引,因此会产生写放大。

(Code omitted for brevity)

Results: This doesn’t perform as well as UUID7 WITHOUT ROWID partly because you still building an index with random insertions (even though it’s not the clustered index). 结果: 这种方式的性能不如 UUID7 WITHOUT ROWID,部分原因是即使它不是聚簇索引,你仍然在构建一个包含随机插入的索引。


Conclusion

结论

Hopefully, this post helps illustrate some of the pitfalls with UUID primary keys in SQLite and how to navigate them. 希望这篇文章有助于说明在 SQLite 中使用 UUID 主键的一些陷阱以及如何应对它们。