Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering

Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering

从一亿份乌克兰法院判决书中自动构建法律引文图谱:大规模提取、拓扑分析与本体驱动的聚类

Abstract: Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy.

摘要: 从 1.007 亿份乌克兰法院判决书中提取的 5 亿条引文边表明,司法引文结构能够在无监督的情况下编码法律领域边界,并以近乎完美的准确度预测未来的立法重要性。

We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]).

我们利用完整的 EDRSR 注册库(9950 万份全文,1.1 TB)构建了首个大规模引文图谱,通过在普通硬件上运行正则表达式,在大约 5 小时内提取了涵盖六种类型的 5.02 亿条引文链接。在 200 份判决书的验证样本中,其精确度达到 1.00(95% Wilson 置信区间:[0.982, 1.000])。

Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions.

研究得出三个主要发现。(1)度分布遵循幂律(alpha = 1.57 +/- 0.008),这使得乌克兰法院网络在结构上接近欧盟法院,低于美国最高法院,其中核心法律条文被数百万份判决书引用。

(2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice.

(2)在共引投影上进行的 Louvain 社区检测恢复了法律领域边界(民事、刑事、行政、商事),其模块度 Q 值在 0.44-0.55 之间,且具有时间稳定性(各时期 NMI = 0.83-0.86),构成了基于司法实践自动构建的法律本体。

(3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes.

(3)引文特征预测前 1000 名重要条文的 AUC 达到 0.9984,显著优于简单的频率基准(P@1000 = 0.655);时间动态分析将立法制度变革识别为相变,并将 2022 年的入侵识别为引文熵的激增(H 值从 11.02 升至 13.49),同时伴随着战时立法节点的涌现。

The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.

该引文衍生本体被应用于大模型(LLM)辅助法律分析的工作流记忆系统,作为其领域层,并与本体控制范式相连接。提取流水线、分析代码及汇总统计数据均已作为开放数据发布。