The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling
The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling
认知范畴 Transformer:用于语言建模的范畴论归纳偏置
Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science.
摘要: 认知范畴 Transformer (CCT) 是一种拥有 3.06 亿参数的架构,它在预训练的 GPT-2 Small 主干网络基础上,增加了源自范畴论的认知基础组件以及若干来自认知科学的灵感。
Under a matched-step protocol (215,000 optimizer steps, matched data, matched optimizer and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline. The architecture therefore contributes a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides.
在 WikiText-103 数据集上,通过匹配步数协议(215,000 次优化器步数、匹配的数据、匹配的优化器和调度)测试,CCT 达到了 21.27 的验证困惑度(perplexity),而经过相同微调的 GPT-2 Small 基准模型为 24.19。因此,该架构在领域内微调的基础上,额外实现了 2.92 PPL(相对提升 12%)的性能优化。
A retrain-from-scratch ablation that holds GT-Full simplicial message passing bypassed across the entire seven-phase activation schedule reaches 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. We present the first ablation-validated evidence that simplicial message passing improves language-model perplexity at the 306M-parameter scale on WikiText-103.
一项从零开始的消融实验显示,当在整个七阶段激活调度中绕过 GT-Full 单纯形消息传递(simplicial message passing)时,困惑度达到 23.72 PPL,这表明 84% 的架构改进(2.92 PPL 中的 2.45 PPL)归功于 GT-Full。我们首次提供了经消融验证的证据,证明单纯形消息传递在 WikiText-103 上 3.06 亿参数规模下能有效改善语言模型的困惑度。
Published GPT-2 Large reaches 22.05 zero-shot PPL on WikiText-103 with 6.2x more parameters than GPT-2 Small; this paper treats that number as an external published reference, not as the architectural benchmark.
已发布的 GPT-2 Large 在 WikiText-103 上达到了 22.05 的零样本(zero-shot)PPL,其参数量是 GPT-2 Small 的 6.2 倍;本文将该数值视为外部已发表的参考数据,而非架构基准。
Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern termed the structure/consistency distinction, in which categorical priors that add new topology improve language modeling and those that enforce a consistency identity do not.
关于一致性风格范畴先验(层平滑、伴随往返、曲率正则化)的三个负面结果,以及 GT-Full 和 PrecisionWeightedPP 的联合结构先验结果,共同支持了一种被称为“结构/一致性区别”的经验模式:即增加新拓扑结构的范畴先验能改善语言建模,而强制执行一致性恒等式的先验则不能。