A columnar database for analytics in pure Clojure

A columnar database for analytics in pure Clojure

用于分析的纯 Clojure 列式数据库

Flatiron is a columnar analytics library for Clojure. It lets you run fast analytical queries on in-memory tables using a SQL-like DSL, and it handles graph algorithms on the same data. It’s pure Clojure with no dependencies beyond core.async. Think of it as what you’d reach for instead of dragging in a full embedded database: load some data, run group-by aggregations, sort and filter, maybe run PageRank on a graph, all in-process with zero configuration. Flatiron 是一个用于 Clojure 的列式分析库。它允许你使用类似 SQL 的 DSL 在内存表上运行快速分析查询,并能处理同一数据上的图算法。它是纯 Clojure 编写的,除了 core.async 外没有其他依赖。你可以把它看作是无需引入完整嵌入式数据库时的替代方案:加载数据、运行分组聚合、排序和过滤,甚至在图上运行 PageRank,所有操作都在进程内完成,且无需任何配置。

Why columnar

为什么选择列式存储?

Most Clojure programs represent tabular data as sequences of maps. That’s fine for a few thousand rows, but it falls apart on larger datasets: every row is a heap-allocated map, every value is boxed, and every access goes through layers of indirection. Flatiron stores data as typed primitive arrays — one array per column. An integer column is a long[], a float column is a double[], and so on. Operations loop over these arrays directly using unchecked arithmetic, which the JVM can optimize into tight native code. Nulls are handled with sentinel values rather than boxed types, so there’s no pointer chasing. The morsel engine processes data in 1024-row batches. This amortizes the cost of type dispatch: decide the operation once per batch, then run a tight loop over primitives. The result is performance closer to native C than to idiomatic Clojure. 大多数 Clojure 程序将表格数据表示为 Map 序列。对于几千行数据来说这没问题,但在处理大数据集时就会崩溃:每一行都是堆分配的 Map,每个值都被装箱(boxed),且每次访问都要经过多层间接寻址。Flatiron 将数据存储为类型化的原始数组——每列一个数组。整数列是 long[],浮点数列是 double[],依此类推。操作直接使用未经检查的算术运算遍历这些数组,JVM 可以将其优化为紧凑的本地代码。空值(Null)通过哨兵值而非装箱类型处理,因此不存在指针追踪问题。Morsel 引擎以 1024 行为一批次处理数据。这分摊了类型分派的成本:每批次确定一次操作,然后对原始类型运行紧凑循环。其结果是性能更接近原生 C 语言,而非惯用的 Clojure。

Installation

安装

Add the git dependency to your deps.edn: 在 deps.edn 中添加 git 依赖:

{io.github.yogthos/flatiron {:git/tag "v0.2.0" :git/sha "98d700ee79b5425cd837db5b7866a69cf4a0f432"}}

It depends on Clojure 1.12.0 and core.async 1.6.681, and requires JDK 18+ (the hash kernels use Math/unsignedMultiplyHigh); CI runs on 21 and 25. 它依赖于 Clojure 1.12.0 和 core.async 1.6.681,并需要 JDK 18+(哈希内核使用了 Math/unsignedMultiplyHigh);CI 在 21 和 25 版本上运行。

Concepts: Columns and tables

概念:列与表

There are five column types. Each stores data as a Java primitive array with optional null sentinels: 共有五种列类型。每种类型都将数据存储为 Java 原始数组,并带有可选的空值哨兵:

  • I64 — signed 64-bit integers (long[])

  • F64 — 64-bit floats (double[])

  • Bool — booleans (byte[])

  • Sym — Clojure keywords (Object[])

  • Str — strings (Object[])

  • I64 — 有符号 64 位整数 (long[])

  • F64 — 64 位浮点数 (double[])

  • Bool — 布尔值 (byte[])

  • Sym — Clojure 关键字 (Object[])

  • Str — 字符串 (Object[])

A Table is a schema (vector of keyword column names) plus a vector of columns. That’s it — no metadata, no indexes, just typed arrays with names. 表(Table)由模式(列名的关键字向量)加上列向量组成。仅此而已——没有元数据,没有索引,只有带名字的类型化数组。

(require '[flatiron.column :as col])
(require '[flatiron.table :as tbl])

(let [dragons (col/sym-column [:smaug :fafnir :tiamat :smaug :fafnir])
      gold (col/i64-column [9000 750 1200 3100 2400])
      table (tbl/table [:Dragon :Gold] [dragons gold])]
  (tbl/nrows table) ;; => 5
  (tbl/ncols table) ;; => 2
  (tbl/col table :Gold)) ;; => #<I64Column ...> — Smaug is hoarding again

Custom types

自定义类型

Domain types like dates and timestamps don’t need their own column type or an Object[] that boxes every value. A LocalDate is just its epoch day, an Instant is just its epoch milli, and those encodings preserve order, so comparison, sorting, group-by, and min/max are all already correct on the underlying primitive. Flatiron stores such a type as a normal long[] column tagged with a logical type: the column reports its physical type (:i64) to every operation, so the hot loops are unchanged and the values are never boxed, and only the boundaries (building a column, reading a value back, persisting) run the codec that converts to and from the domain object. 日期和时间戳等领域类型不需要专门的列类型,也不需要装箱每个值的 Object[]LocalDate 只是其纪元天数,Instant 只是其纪元毫秒数,这些编码保留了顺序,因此比较、排序、分组和最大/最小值在底层原始类型上都是正确的。Flatiron 将此类类型存储为带有逻辑类型标记的普通 long[] 列:列向每个操作报告其物理类型 (:i64),因此热循环保持不变,值永远不会被装箱,只有在边界(构建列、读取值、持久化)时才会运行转换到领域对象的编解码器。

(require '[flatiron.column :as col])
(let [hired (col/date-column [(java.time.LocalDate/of 2019 4 1) (java.time.LocalDate/of 2021 9 15)])]
  (col/-type-tag hired) ;; => :i64 (physical — what operations dispatch on)
  (col/-logical-tag hired) ;; => :date (logical — how values are exposed)
  (col/-get-obj hired 0)) ;; => #object[java.time.LocalDate "2019-04-01"]

Built-in logical types, all backed by long[]: :date (LocalDate), :instant (Instant), :datetime (LocalDateTime), :date-millis (java.util.Date), and :duration (Duration). Build a column for any of them with col/typed-column or the date-column/instant-column/datetime-column/duration-column helpers. 内置逻辑类型,全部由 long[] 支持::date (LocalDate), :instant (Instant), :datetime (LocalDateTime), :date-millis (java.util.Date) 和 :duration (Duration)。使用 col/typed-columndate-column/instant-column/datetime-column/duration-column 辅助函数构建这些列。

Filtering

过滤

The where macro builds a boolean mask by comparing a column against a constant, combines masks for compound predicates with and, or, and not, then materializes the rows that pass into a new table. where 宏通过将列与常量进行比较来构建布尔掩码,使用 andornot 合并复合谓词的掩码,然后将通过过滤的行具体化(materialize)为一个新表。

Morsel engine

Morsel 引擎

Named after the Rayforce concept of a “morsel” (a bite-sized piece of data). Element-wise operations (arithmetic, comparisons) create a morsel source from a column and pull 1024-row batches through it; within each batch, the loop body runs over raw primitive arrays with no protocol dispatch. Aggregations and group-by go one step further: they read the column’s backing array directly in type-specialized loops, falling back to the morsel layer only where the indirection is needed. 得名于 Rayforce 概念中的“morsel”(一口大小的数据块)。元素级操作(算术、比较)从列中创建一个 morsel 源,并从中拉取 1024 行的批次;在每个批次内,循环体直接在原始数组上运行,无需协议分派。聚合和分组更进一步:它们在类型专门化的循环中直接读取列的底层数组,仅在需要间接寻址时才回退到 morsel 层。