The Wiola Architecture for Efficient Small Language Models

The Wiola Architecture for Efficient Small Language Models

Wiola:用于高效小型语言模型的架构

We present Wiola, a fully original Small Language Model (SLM) architecture built from first principles, sharing no structural lineage with any existing model family including GPT, LLaMA, Mistral, or Falcon. 我们提出了 Wiola,这是一种完全原创的小型语言模型(SLM)架构,它是从第一性原理出发构建的,与包括 GPT、LLaMA、Mistral 或 Falcon 在内的任何现有模型家族均无结构上的渊源。

Wiola introduces five independently novel components: (i) Spiral Rotary Positional Encoding (SRPE), which embeds token positions on a three-dimensional helical manifold combining absolute, relative, and hierarchical positional signals; (ii) Gated Cross-Layer Attention (GCLA), providing each decoder layer with soft cross-attention access to compressed summaries of two preceding layers for inter-layer coherence; (iii) Adaptive Token Merging (ATM), which dynamically merges semantically redundant adjacent tokens in middle network layers to reduce attention complexity without information loss; (iv) Dual Stream Feed-Forward (DSFF), replacing the conventional MLP with two parallel streams fused by a learned per-dimension gate; and (v) WiolaRMSNorm, a modified normalisation introducing a per-dimension learned offset vector that prevents representation collapse. Wiola 引入了五个独立的创新组件:(i) 螺旋旋转位置编码(SRPE),它将 Token 位置嵌入到三维螺旋流形上,结合了绝对、相对和分层位置信号;(ii) 门控跨层注意力机制(GCLA),为每个解码器层提供对前两层压缩摘要的软跨层注意力访问,以增强层间连贯性;(iii) 自适应 Token 合并(ATM),在网络中间层动态合并语义冗余的相邻 Token,从而在不丢失信息的情况下降低注意力计算复杂度;(iv) 双流前馈网络(DSFF),用两条通过学习得到的逐维度门控融合的并行流取代了传统 MLP;以及 (v) WiolaRMSNorm,一种改进的归一化方法,引入了逐维度的学习偏移向量,以防止表征坍缩。

We provide complete mathematical derivations, architectural block diagrams, complexity analyses, and systematic comparisons against GPT-2, LLaMA-2, and Mistral. 我们提供了完整的数学推导、架构框图、复杂度分析,以及与 GPT-2、LLaMA-2 和 Mistral 的系统性对比。

Wiola is released in four sizes (120M, 360M, 700M, and 1.5B parameters) and is fully compatible with the HuggingFace Transformers ecosystem, with all 22 architectural unit tests passing. Wiola 发布了四种规模(1.2 亿、3.6 亿、7 亿和 15 亿参数),并与 HuggingFace Transformers 生态系统完全兼容,所有 22 项架构单元测试均已通过。