The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

认知范畴 Transformer：用于语言建模的范畴论归纳偏置

Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science.

摘要： 认知范畴 Transformer (CCT) 是一种拥有 3.06 亿参数的架构，它在预训练的 GPT-2 Small 主干网络基础上，增加了源自范畴论的认知基础组件以及若干来自认知科学的灵感。

Under a matched-step protocol (215,000 optimizer steps, matched data, matched optimizer and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline. The architecture therefore contributes a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides.

在 WikiText-103 数据集上，通过匹配步数协议（215,000 次优化器步数、匹配的数据、匹配的优化器和调度）测试，CCT 达到了 21.27 的验证困惑度（perplexity），而经过相同微调的 GPT-2 Small 基准模型为 24.19。因此，该架构在领域内微调的基础上，额外实现了 2.92 PPL（相对提升 12%）的性能优化。

A retrain-from-scratch ablation that holds GT-Full simplicial message passing bypassed across the entire seven-phase activation schedule reaches 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. We present the first ablation-validated evidence that simplicial message passing improves language-model perplexity at the 306M-parameter scale on WikiText-103.

一项从零开始的消融实验显示，当在整个七阶段激活调度中绕过 GT-Full 单纯形消息传递（simplicial message passing）时，困惑度达到 23.72 PPL，这表明 84% 的架构改进（2.92 PPL 中的 2.45 PPL）归功于 GT-Full。我们首次提供了经消融验证的证据，证明单纯形消息传递在 WikiText-103 上 3.06 亿参数规模下能有效改善语言模型的困惑度。

Published GPT-2 Large reaches 22.05 zero-shot PPL on WikiText-103 with 6.2x more parameters than GPT-2 Small; this paper treats that number as an external published reference, not as the architectural benchmark.

已发布的 GPT-2 Large 在 WikiText-103 上达到了 22.05 的零样本（zero-shot）PPL，其参数量是 GPT-2 Small 的 6.2 倍；本文将该数值视为外部已发表的参考数据，而非架构基准。

Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern termed the structure/consistency distinction, in which categorical priors that add new topology improve language modeling and those that enforce a consistency identity do not.

关于一致性风格范畴先验（层平滑、伴随往返、曲率正则化）的三个负面结果，以及 GT-Full 和 PrecisionWeightedPP 的联合结构先验结果，共同支持了一种被称为“结构/一致性区别”的经验模式：即增加新拓扑结构的范畴先验能改善语言建模，而强制执行一致性恒等式的先验则不能。