Evaluation order and nontermination in query languages

Evaluation order and nontermination in query languages

查询语言中的求值顺序与非终止性

Last month I gave a talk at FLOPS about finite functional programming or ‘λFS’, my latest attempt (previously: Datafun) to combine functional programming with relational programming à la Datalog or SQL – and tensor algebra too, because why not? 上个月,我在 FLOPS 会议上做了一场关于有限函数式编程(简称“λFS”)的演讲。这是我继 Datafun 之后的最新尝试,旨在将函数式编程与 Datalog 或 SQL 风格的关系型编程相结合——顺便还加入了张量代数,毕竟,为什么不呢?

In λFS, a relation R is treated as a function: we let R(x) = true iff x is in R, false otherwise. This function is finite in that R(x) = true for only finitely many inputs x: these are its support. The paper gives a type system for ensuring functions have finite support. 在 λFS 中,关系 R 被视为一个函数:当且仅当 x 在 R 中时,R(x) = true,否则为 false。该函数是有限的,即只有有限数量的输入 x 会使 R(x) = true:这些输入构成了它的“支撑集”(support)。论文提供了一套类型系统,用于确保函数具有有限的支撑集。

At runtime, we represent a finite boolean function by tabulating its support in a hash table or a balanced tree (like a table in a database). This generalizes beyond booleans to any codomain with a ‘default’ value. A function’s support are those inputs whose output is non-default; for instance, the booleans with false as default, or the integers with 0 as default. A finite function can then be represented as a key-value table whose keys are the function’s support; any other key implicitly maps to the default value. Non-boolean finite maps arise naturally as the result of aggregations, while tensor algebra is essentially relational algebra over non-boolean (usually real-valued) finite maps. 在运行时,我们通过将支撑集制表存入哈希表或平衡树(类似于数据库中的表)来表示有限布尔函数。这种方法可以推广到布尔值以外的任何具有“默认值”的陪域。函数的支撑集是指那些输出为非默认值的输入;例如,以 false 为默认值的布尔值,或以 0 为默认值的整数。因此,有限函数可以表示为一个键值表,其键为函数的支撑集;任何其他键隐式映射为默认值。非布尔类型的有限映射自然产生于聚合操作的结果,而张量代数本质上就是针对非布尔(通常是实数)有限映射的关系代数。

Nontermination and evaluation order

非终止性与求值顺序

In the talk, I mentioned that recursion was future work for λFS. Naturally, someone – I think it was Atsushi Igarashi – asked what the difficulty was. The primary difficulty is that recursion allows nontermination. Semantics for nontermination usually use domain theory, which has fallen out of fashion in recent decades, and which consequently I’ve never deeply studied and find a bit intimidating. 在演讲中,我提到递归是 λFS 未来的工作方向。自然地,有人(我想是五十岚淳)问到了其中的难点。主要的困难在于递归允许非终止性。非终止性的语义通常使用域理论(domain theory),但这在近几十年来已不再流行,因此我从未深入研究过,并觉得它有些令人望而生畏。

Moreover, as soon as nontermination or any effect is involved, there is a question of evaluation order. Even in the untyped lambda calculus, where evaluation is confluent, call-by-name and call-by-value evaluation terminate on different programs. What’s worse, in relational languages ‘evaluation order’ has a more expansive meaning than in functional languages. 此外,一旦涉及非终止性或任何副作用,就会出现求值顺序的问题。即使在无类型 lambda 演算中,尽管求值具有汇合性,但按名调用(call-by-name)和按值调用(call-by-value)在不同程序上的终止情况也各不相同。更糟糕的是,在关系型语言中,“求值顺序”比在函数式语言中具有更广泛的含义。

Consider this example: Given R,S : nat => bool input tables and test : nat -> bool a black-box predicate let Q(x) := R(x) and test(x) and S(x) the query 请看这个例子: 给定 R, S : nat => bool 输入表,以及 test : nat -> bool 这个黑盒谓词 设查询 Q(x) := R(x) and test(x) and S(x)

Here, => denotes a finite function, represented as a table, while -> denotes an arbitrary function, represented by a procedure or closure. Thinking relationally, our query Q is the intersection of the tables R and S, filtered by the black-box predicate test. The central question is: in what order do we evaluate this query? 这里,=> 表示一个有限函数(表现为表),而 -> 表示一个任意函数(表现为过程或闭包)。从关系的角度来看,我们的查询 Q 是表 R 和 S 的交集,并由黑盒谓词 test 进行过滤。核心问题是:我们以什么顺序来求值这个查询?

We could go left-to-right: for each x in R, if test(x) passes, check if x is in S. In Python: [x for x in R if test(x) if x in S] 我们可以从左到右执行:对于 R 中的每个 x,如果 test(x) 通过,则检查 x 是否在 S 中。用 Python 表示为: [x for x in R if test(x) if x in S]

Or, noting that test may be expensive but looking up an entry in a table is cheap, we could check if x is in S before calling test: [x for x in R if x in S if test(x)] 或者,考虑到 test 可能开销很大,而查找表中的条目开销很小,我们可以在调用 test 之前检查 x 是否在 S 中: [x for x in R if x in S if test(x)]

And if we happen to know S is much smaller than R, we can save a lot of time by iterating over S instead of R: [x for x in S if x in R if test(x)] 如果我们恰好知道 S 比 R 小得多,我们可以通过遍历 S 而不是 R 来节省大量时间: [x for x in S if x in R if test(x)]

The standard relational philosophy regards evaluation order as an implementation detail, decided by the query planner. However, this ordering may affect whether Q terminates, because different orders call test on different arguments. 标准的数据库关系哲学将求值顺序视为一种实现细节,由查询优化器决定。然而,这种顺序可能会影响 Q 是否终止,因为不同的顺序会在不同的参数上调用 test。

A contrived example: R := {“hello”, “world”} S := {“hello”, “alice”} test(x) := x == “hello” or loop-forever() Q(x) := R(x) and test(x) and S(x) 一个刻意的例子: R := {“hello”, “world”} S := {“hello”, “alice”} test(x) := x == “hello” or loop-forever() Q(x) := R(x) and test(x) and S(x)

Left-to-right execution calls test(“hello”), which succeeds, followed by test(“world”), which loops forever. But if we intersect R and S before calling test, we only test “hello” and so terminate. 从左到右执行会调用 test(“hello”)(成功),接着调用 test(“world”)(陷入死循环)。但如果我们先对 R 和 S 求交集再调用 test,我们只会测试 “hello”,从而能够终止。

I have expressed this problem in λFS, but it arises in any SQL or Datalog engine supporting user-defined functions, and exposes a tension between typical DB and PL assumptions: DB: The query engine may choose the execution strategy to optimize performance. PL: We can reason compositionally about program behavior – in particular, about termination or execution time. 我在 λFS 中表达了这个问题,但它在任何支持用户自定义函数的 SQL 或 Datalog 引擎中都会出现,并揭示了典型数据库(DB)假设与编程语言(PL)假设之间的张力: DB:查询引擎可以选择执行策略以优化性能。 PL:我们可以组合地推理程序行为——特别是关于终止性或执行时间。

λFS tries to bridge DB and PL, and gets stuck in the awkward middle: I must either favor one perspective over another, or find a way to thread the needle’s eye. I do not yet know how best to do this. Instead, here are three directions forward – three sketches of possible semantics for λFS with recursion (or Datalog with nonterminating functions). λFS 试图架起 DB 和 PL 之间的桥梁,却陷入了尴尬的中间地带:我必须要么偏向一方,要么找到一种两全其美的方法。我目前还不知道如何做到最好。因此,这里有三个前进的方向——三个关于 λFS 递归(或带有非终止函数的 Datalog)可能语义的草案。

1. Left to right evaluation has a simple cost model

1. 从左到右求值具有简单的成本模型

The simplest strategy to implement also has the most straightforward semantics: evaluate from left to right. This is the most typical PL approach: it avoids query planning and puts the job squarely in the hands of the programmer. This makes the execution time of a query easy to predict and reason about, as my friend Rob Simmons pointed out to me while working on Finite Choice Logic Programming; this is more or less the cost model of Dusa, an implementation of FCLP. 最简单的实现策略也具有最直接的语义:从左到右求值。这是最典型的 PL 方法:它避免了查询规划,并将工作完全交给了程序员。正如我的朋友 Rob Simmons 在研究有限选择逻辑编程(FCLP)时向我指出的那样,这使得查询的执行时间易于预测和推理;这或多或少是 Dusa(FCLP 的一种实现)的成本模型。