Type-checked non-empty strings
Type-checked non-empty strings
This post is a Haskell koan. We’ll get to the background and motivation, but the goal here is to share a small and uncommon technique that we’ve employed, and enjoyed; perfect blog fodder. In short, we wrote a type-checked non-empty string constructor, replaced thousands of equivalent TemplateHaskell calls, for a ~10% build-time improvement in a large and data-heavy package that made many calls to it.
这篇文章是一则 Haskell 公案。我们稍后会介绍背景和动机,但本文的目标是分享一种我们使用过且非常喜欢的小众技巧;这绝对是绝佳的博客素材。简而言之,我们编写了一个经过类型检查的非空字符串构造函数,替换了数千个等效的 TemplateHaskell 调用,从而在一个包含大量数据且频繁调用该构造函数的软件包中,实现了约 10% 的构建时间优化。
before, after, invalidBefore, invalidAfter :: NonEmptyText
before = $$(NonEmptyText.make "hello")
after = NonEmptyText.make "hello"
invalidBefore = $$(NonEmptyText.make "") -- ⇝ error during splice evaluation ...
invalidAfter = NonEmptyText.make "" -- ⇝ type error: expected a non-empty string
To make invalid states unrepresentable is a core design goal of the software at Bellroy. With that in mind, a type we often employ for textual data – of which we have a lot – is NonEmptyText. It is just what it sounds like: a value of this type is a string with at least one character.
让非法状态无法被表示,是 Bellroy 软件开发的核心设计目标。考虑到这一点,我们经常为文本数据(我们有大量此类数据)使用一种名为 NonEmptyText 的类型。正如其名:该类型的值是一个至少包含一个字符的字符串。
The technique is a result of a convergence of GHC features over the past 15 years or so. In particular, RequiredTypeArguments, introduced in GHC 9.10, lets us pass a type-level string literal into a function as if it were a value. We can throw a custom type error message like “Expected a non-empty string” if we (at the type-level) see an empty string; like so:
这项技术是过去 15 年左右 GHC 特性融合的产物。特别是 GHC 9.10 引入的 RequiredTypeArguments,它允许我们将类型级别的字符串字面量像值一样传递给函数。如果我们(在类型级别)检测到空字符串,就可以抛出自定义的类型错误消息,例如 “Expected a non-empty string”;如下所示:
type family IsNonEmptySymbol symbol :: Constraint where
IsNonEmptySymbol "" = Unsatisfiable (Text "Expected a non-empty string")
IsNonEmptySymbol _ = (()::Constraint) -- empty constraint is always satisfied
-- Note previous syntax without RequiredTypeArguments:
-- make :: forall symbol. IsNonEmptySymbol symbol => NonEmptyText
-- with usage like `make @"hello!"`
make :: forall symbol -> (IsNonEmptySymbol symbol) => NonEmptyText
make symbol = NonEmptyText (fromString (symbolVal (Proxy :: Proxy symbol)))
test :: NonEmptyText
test = make "hello!"
Which, with the right LANGUAGE incantations, does actually work. This requires UndecidableInstances to work, which is not harmful in and of itself, but does open up the possibilities of what can go wrong. Additionally, since IsNonEmptySymbol is a type family, it can’t directly be used like an ordinary typeclass constraint – for instance, it can’t be packaged into a Dict, or used with functions like Data.SOP.hcfoldMap; it’s not something you can just “ask for an instance of” like an ordinary typeclass, despite returning a Constraint like one.
只要加上正确的 LANGUAGE 编译指示,这段代码确实可以工作。这需要 UndecidableInstances 的支持,它本身并无害处,但确实增加了出错的可能性。此外,由于 IsNonEmptySymbol 是一个类型族(type family),它不能像普通的类型类约束那样直接使用——例如,它不能被打包进 Dict,也不能与 Data.SOP.hcfoldMap 等函数一起使用;尽管它返回的是一个 Constraint,但你不能像对待普通类型类那样“请求它的实例”。
The final step then to this trick is writing IsNonEmptySymbol as a typeclass:
因此,实现这一技巧的最后一步是将 IsNonEmptySymbol 写成一个类型类:
class IsNonEmptySymbol symbol
instance {-# OVERLAPPING #-} Unsatisfiable (Text "Expected a non-empty string") => IsNonEmptySymbol ""
instance IsNonEmptySymbol a
-- make: same as above
make :: forall symbol -> (IsNonEmptySymbol symbol) => NonEmptyText
make symbol = NonEmptyText (fromString (symbolVal (Proxy :: Proxy symbol)))
When GHC resolves the IsNonEmptySymbol constraint given an empty string, it finds both: instance IsNonEmptySymbol "" and instance IsNonEmptySymbol a. If we omitted the OVERLAPPING pragma, that would be where GHC raises an error; it wouldn’t know which instance to actually choose, and would complain that there are overlapping possibilities – which is fine, since the only case where there is an overlapping instance is in the case we want to forbid, when the input is "". So the effect of the OVERLAPPING pragma here is that GHC does choose the instance we “want”; the one with the custom type error. The instance then raises our custom error message informing the user that a non-empty string was expected.
当 GHC 在给定空字符串时解析 IsNonEmptySymbol 约束,它会同时找到 instance IsNonEmptySymbol "" 和 instance IsNonEmptySymbol a。如果我们省略 OVERLAPPING 编译指示,GHC 就会报错;因为它不知道该选择哪个实例,并会抱怨存在重叠的可能性——但这没关系,因为唯一存在重叠的情况正是我们想要禁止的情况,即输入为 "" 时。因此,这里 OVERLAPPING 编译指示的作用是让 GHC 选择我们“想要”的那个实例,即带有自定义类型错误的那个。随后,该实例会抛出我们的自定义错误消息,告知用户需要一个非空字符串。
Impact
Our internal bellroy-data package – containing data such as information about known freight and shipping providers, accounting systems, product data, tax codes, etc. – had thousands of TH splices like $$(NonEmptyText.makeTH _). Moving to the RequiredTypeArgument approach shaved off around 10% of the compilation time for that package.
影响
我们的内部软件包 bellroy-data 包含诸如已知货运和物流供应商信息、会计系统、产品数据、税码等数据,其中有数千个类似 $$(NonEmptyText.makeTH _) 的 TH 拼接。转向 RequiredTypeArgument 方法后,该软件包的编译时间缩短了约 10%。
Similar cases we can employ this trick
We can re-use (almost) the exact same code to validate that a given Natural is positive to make a type-checked Positive constructor. In general we can use this technique for any type we can define a type-level predicate for.
我们可以应用此技巧的类似场景
我们可以(几乎)复用完全相同的代码来验证给定的 Natural 是否为正数,从而创建一个经过类型检查的 Positive 构造函数。通常,对于任何可以定义类型级别谓词(type-level predicate)的类型,我们都可以使用这种技术。
From here you might be able to imagine, for instance, type-level string parsing to define for instance type-safe term syntax for constructing URIs: it will work, but you’ll run into GHC’s default reduction limit of 20 quite quickly; an O(n) type-level validator over a n-length string has a hard-cap of 20 “reduction” (i.e., parsing) steps and thus parsing cannot proceed beyond 20 characters – that is, also assuming your parsing steps themselves aren’t counting towards that limit.
由此你或许可以想象,例如,进行类型级别的字符串解析,以定义用于构建 URI 的类型安全项语法:这行得通,但你会很快遇到 GHC 默认的 20 次归约限制;对长度为 n 的字符串进行 O(n) 的类型级别验证,其“归约”(即解析)步骤有 20 次的硬上限,因此解析无法超过 20 个字符——当然,这是假设你的解析步骤本身没有计入该限制的情况下。
In general non-trivial algorithms are also quite awkward to express in type families; you have no way to write let-bindings for instance, or do pattern matching with a case-like syntax: if you need either, they must be expressed with some combination of extra type arguments and helper type families. This also needs a bit of plumbing to work as a typeclass as in the above IsNonEmptySymbol class we created.
通常,在类型族中表达非平凡算法也非常笨拙;例如,你无法编写 let 绑定,也无法使用类似 case 的语法进行模式匹配:如果需要这些功能,必须通过额外的类型参数和辅助类型族的组合来表达。这还需要一些额外的工作,才能像我们上面创建的 IsNonEmptySymbol 类那样作为类型类工作。
Nevertheless, if you’ve gotten this far, you’re probably interested in what that can look like: behold 🪄, type-level parsing for DynamoDB table-names:
尽管如此,如果你读到了这里,你可能对它看起来是什么样子很感兴趣:看好了 🪄,这是用于 DynamoDB 表名的类型级别解析:
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE QuantifiedConstraints #-}
{-# LANGUAGE RequiredTypeArguments #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE UndecidableInstances #-}
import Data.Proxy
import Data.String (fromString)
import Data.Text (Text)
import Data.Type.Bool qualified as Bool
import GHC.TypeError
import GHC.TypeLits
-- | Valid DynamoDB table names must match regex /^[a-zA-Z_.-]{3,255}$/
newtype TableName = TableName Text deriving (Show)
make :: forall name -> (IsValidTableName name) => TableName
make name = TableName (fromString (symbolVal (Proxy :: Proxy name)))
So far, so good. But enumerating every single invalid string will take some time; we need an algorithmic approach rather than how IsNonEmptySymbol worked above. As above, we’d prefer to encapsulate th…
到目前为止,一切顺利。但枚举每一个非法字符串会花费不少时间;我们需要一种算法方法,而不是像上面 IsNonEmptySymbol 那样的工作方式。和上面一样,我们更倾向于封装……