Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary

Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary

将 3 GB 的 SQLite 数据库替换为 10 MB 的 FST(有限状态转换器)二进制文件

Add me on X / Twitter! You can cite this post as a reason if you’re shy. Note for numberphiles: all numbers have been rounded to their first significant digit, because I’m a fan of Rob Eastaway’s “zequals” method of getting to the point when it comes to estimation. It’s much more valuable to walk away with the heuristic “some dude got a 300x memory reduction by swapping out a database he hacked together for a tiny, static, specialized data structure that does exactly what he needs it to and no more.” 在 X / Twitter 上关注我!如果你比较害羞,可以引用这篇文章作为关注理由。给数字爱好者的说明:所有数字都已四舍五入到首位有效数字,因为我非常推崇 Rob Eastaway 的“zequals”估算方法,即直奔主题。比起精确数字,记住这个启发式结论更有价值:“有个家伙通过将他拼凑出来的数据库替换为一种微小、静态、专门的数据结构,实现了 300 倍的内存缩减,而这种结构正好满足了他的需求,不多也不少。”

I found myself with an increasingly rare opportunity to work this weekend on Taskusanakirja, also often called tsk, a Finnish-English dictionary with incremental search-as-you-type. Fundamentally this problem reduces down to prefix search, and the canonical solution for prefix search with autocomplete is to implement a trie. 这个周末,我获得了一个难得的机会来处理 Taskusanakirja(通常简称为 tsk),这是一个支持增量式“边输边搜”功能的芬兰语-英语词典。从根本上讲,这个问题可以归结为前缀搜索,而实现带自动补全的前缀搜索的经典方案是使用字典树(Trie)。

And this worked wonderfully for the first implementation of tsk, which was in Go. With a few basic optimizations. To prevent matching on some single-digit percentage of the mid-six-figures number of words we were baking into the binary (it’s been a design goal from the start to ship the entire program as one .exe, one .app, or one statically linked binary), we set some limit of e.g. only the first 50 or 100 matches or so and then just memoized all of the 1-, 2-, and 3-character combinations, after which I’ve never noticed a perceptible delay in the program again after a year of heavy personal use. We could just about squeeze a trie with some basic optimizations like that into ~60 MB of space. 这对于 tsk 的第一个 Go 语言版本来说效果非常好。通过一些基本的优化,我们避免了在数十万个内置单词中进行匹配(从一开始,我们的设计目标就是将整个程序打包成一个 .exe、一个 .app 或一个静态链接的二进制文件)。我们设置了限制,例如只显示前 50 或 100 个匹配项,并对所有 1、2 和 3 字符的组合进行了记忆化处理。经过一年的高强度个人使用,我再也没有感觉到程序有任何明显的延迟。通过这些基础优化,我们勉强将字典树压缩到了约 60 MB 的空间内。

But Finnish is a heavily agglutinative language. It’s not impossible for a single base word in the language to have over one hundred possible endings, when all combinations are considered. And the combinations are not regular! The extremely regularized orthography of the Finnish language also means no fibbing when it comes to what speakers actually say on the page, and that means that base words stretch and shift and transmute in acoustically-pleasing ways as you layer on endings, which makes perfect sense after you’ve spent a couple years already immersed in the language. 但芬兰语是一种高度黏着语。考虑到所有组合,一个词根可能有超过一百种可能的词尾。而且这些组合并不规则!芬兰语极其规范的拼写意味着书面语与口语高度一致,这也意味着当你添加词尾时,词根会以悦耳的方式延伸、位移和变形。在你沉浸在这门语言中几年后,这一切就变得非常合理了。

When you’re a beginner, and you see a sentence like e.g. “Opiskelijassammekin on leijonan sydän”, there is one word you are disproportionately likely to get stuck on. Part of what this tool attempts to do is help the student figure out how to cleave the word at the right edges by embedding all that information as well. 当你还是个初学者,看到像“Opiskelijassammekin on leijonan sydän”这样的句子时,总有一个词会让你特别困惑。这个工具的部分目的就是通过嵌入所有相关信息,帮助学生弄清楚如何正确地拆解单词。

The trie fell down at that point. I could keep ~400,000 items in the trie sipping ~50 MB of RAM. The same trick does not scale to 40-60 million. Not if you want it all to run on the old laptop of a college kid from Jakarta. Frustrated and running out of time, I threw up my hands and said “We’ll ship the inflections in a separate SQLite database with FTS (Full Text Search) and let them search on that if they’re so desperate,” which worked — still without perceptible delay — but it required a one time 3 gigabyte download. Not ideal! 字典树在这一点上失效了。我可以在字典树中存储约 40 万个条目,占用约 50 MB 内存。但同样的技巧无法扩展到 4000 万到 6000 万个条目。如果你希望它能在雅加达某位大学生的旧笔记本电脑上运行,这显然行不通。在沮丧和时间紧迫的情况下,我放弃了,决定将词形变化放在一个单独的 SQLite 数据库中,并使用 FTS(全文搜索)功能。如果用户有迫切需求,就让他们搜索那个数据库。这确实奏效了——依然没有明显的延迟——但它需要一次性下载 3 GB 的数据。这并不理想!

That was where the story stopped about 9 months ago. This weekend, with 9 more months of intense full time software engineering under my belt, I boldly asked: Had I considered rewriting it in Rust? 故事在 9 个月前停在了这里。这个周末,带着 9 个月高强度的全职软件工程经验,我大胆地问自己:我是否考虑过用 Rust 重写它?

It turns out there is a very, very smart guy named BurntSushi aka Andrew Gallant, infamous for ripgrep, a really really fast grep — a tool so ubiquitously useful I put it years ago in my “Holy Trinity” of modern shell commands — who also faced a similar problem at some point in the past, and wrote a post called Index 1,600,000,000 Keys with Automata and Rust. (Warning: long, extremely interesting.) The opening spoils it: It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be [prefix, fuzzy, suffix] searched very quickly. 事实证明,有一位非常聪明的人叫 BurntSushi(即 Andrew Gallant),他因开发了 ripgrep 而闻名——那是一个极快的 grep 工具,非常实用,我几年前就把它列入了我的现代 Shell 命令“三巨头”中。他过去也遇到过类似的问题,并写了一篇名为《用自动机和 Rust 索引 16 亿个键》的文章。(警告:文章很长,但非常有趣。)开篇就揭示了答案:原来有限状态机不仅可以用于表达计算,还可以用来紧凑地表示有序集合或字符串映射,并能非常快速地进行 [前缀、模糊、后缀] 搜索。

Well, I thought, this seems promising. Let’s write a minimal Rust program to strip the data out of that 3 GB database and compact it down into one of these FST thingies. I mean, it was always obvious that was a hack, but it was the best hack I could manage with the time and energy at the time. How small could we get it? 嗯,我想,这看起来很有希望。让我们写一个最小化的 Rust 程序,把数据从那个 3 GB 的数据库中提取出来,并压缩成这些 FST 玩意儿之一。我的意思是,那个 3 GB 的方案显然是个权宜之计,但在当时的时间和精力限制下,那是我能做到的最好方案。我们能把它压缩到多小呢?

Ten _mega_bytes. A 300x reduction in space. Even in the world of fst crate users, this particular application — mapping conjugations and declensions of a highly agglutinative language back to their source definitions — was extremely well suited to the domain. Unlike tries, FSTs compress both prefixes and suffixes, and in a language like Finnish, there are a very small handful of popular suffixes which get repeated extremely often in the dictionary corpus. The data load is static at runtime, which gets around fst’s greatest weakness. 十兆字节。空间缩减了 300 倍。即使在 fst crate 的用户群中,这个特定的应用场景——将高度黏着语的变位和变格映射回其词源定义——也非常契合该技术。与字典树不同,FST 同时压缩前缀和后缀。在芬兰语这样的语言中,有极少数常用的后缀在词典语料库中反复出现。运行时数据是静态的,这规避了 FST 的最大弱点。

I do wish to point out, of course, that the whole reason it was possible to experiment cheaply and come across this serendipity was because 9 months ago, faced with the choice to either do the bad easy thing or the good nothing, I chose to do the bad easy thing. The SQLite database worked! I understood how it worked, behind the scenes with its B-trees and its Full Text Search extension. I think I even used that same FTS extension to power certain lesser used features that are not in the alphas of tsk v2.0.0 at the time being and are likely to be dropped entirely if it means compromising this now salivatory memory footprint. 当然,我必须指出,之所以能够低成本地进行实验并获得这一意外发现,是因为 9 个月前,在“做糟糕但简单的方案”和“什么都不做”之间,我选择了前者。SQLite 数据库确实奏效了!我了解它的工作原理,包括其背后的 B 树和全文搜索扩展。我想我甚至用那个 FTS 扩展来支持某些较少使用的功能,这些功能目前不在 tsk v2.0.0 的 alpha 版本中,如果它们会影响现在这个令人垂涎的内存占用,很可能会被彻底砍掉。

Because the Pro version of v2 is shaping up to be about 20 megabytes, all batteries included, which is 3 times less than the free version of v1 ever was. We’ll see what makes it past the cutting room in time. tsk started life as a TUI Go program — and in fact evolved out of an earlier fzf prototype called finstem, see the highest-ROI program I’ve written so far. The “pocket dictionary” framing (taskusanakirja literally means “pocket dictionary” in Finnish) was always load-bearing: if it doesn’t fit on the kind of dusty laptop someone might inherit from an uncle, it isn’t a pocket dictionary. 因为 v2 的 Pro 版本最终大小约为 20 MB(包含所有功能),这比 v1 的免费版还要小 3 倍。我们将拭目以待最终会有哪些功能保留下来。tsk 最初是一个 Go 语言编写的 TUI 程序,实际上它演变自一个更早的名为 finstem 的 fzf 原型,那是我写过的投资回报率最高的程序。“袖珍词典”的定位(taskusanakirja 在芬兰语中字面意思就是“袖珍词典”)一直非常关键:如果它不能在某人从叔叔那里继承来的那种落满灰尘的笔记本电脑上运行,那它就不是袖珍词典。