A Modular Architecture for Typologically Controlled Lexicon Generation

一种用于类型学受控词库生成的模块化架构

Abstract: Constructing artificial lexicons that are pronounceable, typologically plausible, and semantically structured remains an open challenge in computational linguistics. Existing conlang generators either lack formal phonotactic guarantees or delegate generation to opaque, non-reproducible LLM-based pipelines.

摘要： 构建既可发音、符合类型学规律又具备语义结构的各种人工词库，在计算语言学领域仍是一项未决挑战。现有的构造语言（conlang）生成器要么缺乏形式化的音位组合规则保证，要么将生成过程委托给不透明且难以复现的基于大语言模型（LLM）的流水线。

We propose a modular framework that samples phoneme inventories from PHOIBLE, generates word forms under interchangeable phonological grammars (deterministic, OT, and MaxEnt), and assigns meanings via a Swadesh—Leipzig—Jakarta ontology with explicit form—meaning alignment.

我们提出了一种模块化框架，该框架从 PHOIBLE 数据库中采样音位清单，在可互换的音系语法（确定性语法、最优性理论 OT 和最大熵模型 MaxEnt）下生成词形，并通过 Swadesh—Leipzig—Jakarta 本体论进行明确的形义对齐来赋予词汇含义。

Evaluation on character $n$-gram perplexity, log-likelihood, and KL divergence against PHOIBLE across lexicon sizes of 100-5,000 forms shows that probabilistic grammars consistently outperform deterministic and random baselines on both phonotactic coherence and typological realism.

通过在 100 到 5,000 个词汇规模下，针对字符 $n$-gram 困惑度、对数似然和相对于 PHOIBLE 的 KL 散度进行评估，结果表明概率语法在音位组合连贯性和类型学真实性方面，始终优于确定性基准和随机基准。