DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models
DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models
DeFAb:基础模型中可废止溯因推理的可验证基准
Abstract: A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). 摘要: 一个基于规则的逻辑求解器能在 50 微秒内以 100% 的准确率解决我们基准测试中的每一个实例;而最顶尖的前沿语言模型在最佳情况下仅能达到 65% 的准确率,在渲染稳健性评估(四种表面渲染形式下的最差情况)中更是降至 23.5%。
We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. 我们推出了 DeFAb(可废止溯因基准),这是一个数据集及生成流水线,它将过去四十年间公共资助的知识库转化为可废止溯因(defeasible abduction)的正式基础实例:即通过覆盖默认假设来解释异常,同时保留不相关预期,从而构建假设。
Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. 由于每个假设都必须通过多项式时间检查以验证其推导有效性、保守性和最小性,DeFAb 将逻辑严密性作为衡量创造力和理论推理的工具,旨在评估严谨的理论修订构建过程,而非仅仅评估流畅但破坏理论的文本。
The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. 该流水线将分类层次结构(OpenCyc、YAGO、Wikidata)与行为属性图(ConceptNet、UMLS)相结合,从 18 个来源中生成了超过 372,648 个实例,涵盖 3375 万条具体化规则,并分为三个难度等级,且均具备多项式时间可验证的黄金标准。
Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. 四种前沿模型均未能可靠地内化可废止推理:其渲染稳健的二级准确率仅为 7.8%-23.5%;思维链的方差(约 36 个百分点)超过了模型间的任何差距;此外,通过匹配的污染控制实验,我们分离出了 19.4 个百分点的三级难度差距。
We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). 我们还发布了 DeFAb-Hard(包含 235 个三级难度实例的变体;最佳模型准确率为 53.3%,而符号逻辑为 100%)以及 CONJURE(一个经内核验证的变革性创造力变体,包含 560 个 Lean 4/Mathlib 实例,其黄金答案是证明内核此前未包含的定义,采用无裁判验证器;初步测试发现模型未能产生任何新颖概念)。
The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at this https URL. 同一个验证器还可以作为偏好优化(DPO, RLVR/GRPO)的精确奖励函数。该项目已在 MIT 协议下发布,详情请见此链接。