Deconstructing Datalog
Deconstructing Datalog
In September 2022, after two rounds of revisions, I submitted the final version of my PhD dissertation, Deconstructing Datalog. Datalog is a logic programming language from the ’80s that augments relational algebra with recursive queries. It has both simple semantics and efficient implementation strategies. Like Lisp and the Velvet Underground, its influence exceeds its popularity; its ideas are still being absorbed into the mainstream.
2022 年 9 月,经过两轮修订,我提交了我的博士论文最终版——《解构 Datalog》(Deconstructing Datalog)。Datalog 是一种源自 80 年代的逻辑编程语言,它通过递归查询增强了关系代数。它既拥有简单的语义,又具备高效的实现策略。就像 Lisp 和地下丝绒乐队(The Velvet Underground)一样,它的影响力远超其流行程度;其思想至今仍在被主流技术所吸收。
While Datalog has a high power-to-weight ratio, it is fairly limited. For one thing, it has no functions or procedures: no way to abstract out repetitive code. As a fan of typed functional programming, I figured: how hard could it be? We have x (bottom-up logic programming); we have y (functional programming); what is x + y? Like a child playing with Legos, I set out to mash two cool things together to make a bigger, cooler thing: Datafun. It worked! However, there were complications. And if you want the whole story, you’ll just have to read it. It’s a reasonably short dissertation: 97 pages plus end-matter. But the primary results are that (i) this wacky idea works, and (ii) we can make it go reasonably fast, asymptotically speaking.
虽然 Datalog 的“功重比”很高,但它相当有限。首先,它没有函数或过程:无法抽象出重复的代码。作为一名类型化函数式编程的粉丝,我想:这能有多难?我们有 x(自底向上的逻辑编程),我们有 y(函数式编程),那么 x + y 是什么?就像玩乐高的孩子一样,我着手将两样酷东西拼在一起,创造出一个更大、更酷的东西:Datafun。它成功了!然而,过程中出现了一些复杂情况。如果你想了解完整的故事,就得去读读论文。这篇论文篇幅适中:97 页正文加附录。但主要结论是:(i) 这个疯狂的想法行得通,(ii) 从渐近分析的角度来看,我们可以让它运行得相当快。
Taking Datalog apart and putting it back together again
拆解与重构 Datalog
My dissertation’s central idea is that we can seamlessly integrate Datalog’s features into a typed functional language by working backward from their semantics. Take Datalog’s most distinctive feature, recursive queries; for instance, reachability in a graph:
reachable(start). reachable(Y) :- reachable(X), edge(X,Y).
This finds the smallest set reachable such that (1) start is reachable, and (2) if X is reachable and has an edge to Y, then Y is also reachable. We can rewrite these conditions as a single inequality on sets:
reachable ⊇ {start} ∪ {y : x ∈ reachable, (x,y) ∈ edge}
我论文的核心思想是,通过从语义出发进行逆向推导,我们可以将 Datalog 的特性无缝集成到类型化函数式语言中。以 Datalog 最显著的特性——递归查询为例,比如图中的可达性问题:
reachable(start). reachable(Y) :- reachable(X), edge(X,Y).
这会找到最小的可达集合,使得 (1) start 是可达的,且 (2) 如果 X 可达且存在一条指向 Y 的边,那么 Y 也是可达的。我们可以将这些条件重写为集合上的一个不等式:
reachable ⊇ {start} ∪ {y : x ∈ reachable, (x,y) ∈ edge}
We can rewrite this to factor out the right-hand side:
reachable ⊇ f(reachable) where f(R) = {start} ∪ {y : x ∈ R, (x,y) ∈ edge}
This asks for a least prefix point: the least R that includes at least f(R) for some function f on relations. In fact, all recursive queries in Datalog fit this pattern. So to capture recursive queries in a functional language, we need (a) an expressive language for functions over relations, and (b) a least prefix point operator, fix. This turns our query into:
fix (λR. {start} ∪ {y | x ∈ R, (x,y) ∈ edge})
In a few short steps, we’ve turned predicates-and-logic into sets-and-functions. Chapter 2 works out this recipe in detail.
我们可以将其重写以提取右侧部分:
reachable ⊇ f(reachable),其中 f(R) = {start} ∪ {y : x ∈ R, (x,y) ∈ edge}
这要求寻找一个最小前缀点(least prefix point):对于关系上的某个函数 f,寻找包含至少 f(R) 的最小 R。事实上,Datalog 中的所有递归查询都符合这种模式。因此,要在函数式语言中捕获递归查询,我们需要 (a) 一种用于关系函数的表达性语言,以及 (b) 一个最小前缀点算子 fix。这使我们的查询变为:
fix (λR. {start} ∪ {y | x ∈ R, (x,y) ∈ edge})
只需几步,我们就将“谓词与逻辑”转化为了“集合与函数”。第二章详细阐述了这个方案。
Most uniquely, to capture Datalog’s stratification condition, which ensures recursive queries are well-defined, Datafun needs to track monotonicity in its type system. This works compositionally: the composition of two maps is monotone if both maps are monotone; otherwise not. More technically, non-monotonicity can be represented in an otherwise monotone world by (in category theory) a monoidal comonad or (in type theory) a necessity modality or (in the compiler) carefully tracking which variables are safe to use non-monotonically. Like most type systems this is safe but incomplete — the type-checker rejects some monotone programs as non-monotone — but it handles everything Datalog can, and perhaps more.
最独特的是,为了捕获 Datalog 的分层条件(确保递归查询定义良好),Datafun 需要在其类型系统中跟踪单调性。这是组合式工作的:如果两个映射都是单调的,则它们的组合也是单调的;否则则不然。从技术上讲,非单调性可以在一个原本单调的世界中通过(范畴论中的)幺半群余单子(monoidal comonad)、(类型论中的)必然性模态(necessity modality),或者(在编译器中)仔细跟踪哪些变量可以安全地非单调使用来表示。像大多数类型系统一样,这是安全但不完备的——类型检查器会拒绝某些单调程序,认为它们是非单调的——但它能处理 Datalog 能处理的一切,甚至更多。
How to find fixed points fast(er)
如何更快地寻找不动点
An important detail omitted above is fix’s implementation: how we find the least set R such that R ⊇ f(R). Take graph reachability, for instance:
f(R) = {start} ∪ {y | x ∈ R, (x,y) ∈ edge}
The naïve approach is to iterate f:
R₀ = ∅ = nothing
R₁ = f(∅) = {start}
R₂ = f(f(∅)) = nodes within 1 edge of start
R₃ = f(f(f(∅))) = nodes within 2 edges of start
… until eventually Rᵢ = Rᵢ₊₁ because we have found the most distant reachable nodes.
上面省略的一个重要细节是 fix 的实现:我们如何找到满足 R ⊇ f(R) 的最小集合 R。以图的可达性为例:
f(R) = {start} ∪ {y | x ∈ R, (x,y) ∈ edge}
朴素的方法是迭代 f:
R₀ = ∅ = 空
R₁ = f(∅) = {start}
R₂ = f(f(∅)) = 距离 start 一条边以内的节点
R₃ = f(f(f(∅))) = 距离 start 两条边以内的节点
……直到最终 Rᵢ = Rᵢ₊₁,因为我们已经找到了最远的可达节点。
This is an extremely inefficient breadth-first search, because it does redundant work. Consider the set-comprehension {y | x ∈ Rᵢ, (x,y) ∈ edges} needed to compute f(Rᵢ). This examines every node in Rᵢ; so in total we do at least as much work as the sum of the sizes of each Rᵢ. Now observe that the sequence Rᵢ grows monotonically, R₀ ⊆ R₁ ⊆ R₂ ⊆ ..., because a node within 1 edge of start is also within 2 edges, et cetera. So if we reach a node in an early iteration, we wastefully re-examine it in every following iteration. On some graphs this produces quadratic blowup (e.g. let edge = {(i, i+1) | i ∈ [1..n]}), each node being examined on average a linear number of times!
这是一种效率极低的广度优先搜索,因为它做了冗余工作。考虑计算 f(Rᵢ) 所需的集合推导 {y | x ∈ Rᵢ, (x,y) ∈ edges}。它会检查 Rᵢ 中的每一个节点;因此总工作量至少等于每个 Rᵢ 大小的总和。现在观察到序列 Rᵢ 是单调增长的,R₀ ⊆ R₁ ⊆ R₂ ⊆ ...,因为距离 start 一条边以内的节点也一定在两条边以内,依此类推。所以,如果我们能在早期迭代中到达一个节点,那么在随后的每一次迭代中重复检查它就是一种浪费。在某些图上,这会导致二次方爆炸(例如令 edge = {(i, i+1) | i ∈ [1..n]}),每个节点平均被检查了线性次数!
The solution is to examine each node only once. Naïve iteration asks “what can I deduce in at most n steps?” for iteratively increasing n. Seminaïve iteration asks: “what can I deduce in n steps that I couldn’t deduce before?” In other words, it asks for the changes between steps of naïve iteration — the frontiers of knowledge. We can compute these frontiers by incrementalizing our deduction function f, figuring out how it responds to change: given a change to its input, how does its output change — i.e. given the previous frontier, what is the next frontier? This in turn we can do by taking the discrete derivative of our program, in a sense made precise by prior work on the incremental λ-calculus. Chapter 3 shows how to apply this work to Datafun, incrementalizing it with respect to monotone (increasing) change.
解决方案是每个节点只检查一次。朴素迭代通过不断增加 n 来询问“我最多能在 n 步内推导出什么?”。半朴素迭代(Seminaïve iteration)则询问:“在 n 步内,我能推导出哪些之前无法推导出的内容?”换句话说,它寻找的是朴素迭代步骤之间的变化——即知识的边界。我们可以通过对推导函数 f 进行增量化来计算这些边界,弄清楚它如何响应变化:给定输入的改变,输出如何变化——即给定上一个边界,下一个边界是什么?这可以通过对我们的程序进行离散微分来实现,这一概念在关于增量 λ-演算(incremental λ-calculus)的先前工作中已得到精确定义。第三章展示了如何将这项工作应用于 Datafun,使其能够针对单调(增加)的变化进行增量化。
Sine qua non
不可或缺
My dissertation lacks acknowledgements because I found them hard to write. Any real choice meant leaving someone out, and the non-choice of throwing everyone in felt disingenuous and exhausting. I took the easy way out and said nothing. For this post I’ve chosen a stark criterion: people without whom I would not have written this thesis. This is a pretty short list: Neel Krishnaswami, my advisor — without whom I might not have written any thesis, let alone this one — for encouraging the fumblings that eventually became Datafun; for our two papers together (so far); and for everything else. Eve/Kodowa (at the time, Chris Granger, Rob Attorri, and Jamie Brandon) for introducing me to Datalog and turning me down from my dream job, ensuring I kept…
我的论文没有致谢,因为我觉得很难写。任何真正的选择都意味着会遗漏某人,而把所有人都写进去这种“不选择”又显得虚伪且令人疲惫。我选择了最简单的做法:什么都不说。但在这篇文章中,我选择了一个严格的标准:那些没有他们我就无法完成这篇论文的人。名单很短:我的导师 Neel Krishnaswami——没有他,我可能根本写不出任何论文,更不用说这一篇了——感谢他鼓励了那些最终演变成 Datafun 的笨拙尝试;感谢我们共同发表的两篇论文(目前为止);以及感谢他所做的一切。Eve/Kodowa(当时的 Chris Granger、Rob Attorri 和 Jamie Brandon),感谢他们将我引入 Datalog 的世界,并拒绝了我梦寐以求的工作机会,从而确保我能够继续……