Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

模型崩溃的流行病学：通过双层 SIR 动力学建模合成数据污染

Abstract: Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework — a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission.

摘要： 使用合成数据进行训练会导致模型崩溃，但现有的分析将其视为单链退化。实际上，人工智能生态系统涉及交叉污染：模型摄取来自其他模型的合成数据，生成新的合成文本，并污染共享语料库。我们提出了一个双层耦合的 SIR/SIRS 框架——这是一种现象学平均场模型，将数据语料库和人工智能模型视为两个相互作用的群体，每个群体都包含通过跨层传播相互关联的易感、感染和恢复区室。

The SIRS variant (our primary recommendation) incorporates immunity waning, reflecting that filtered corpora and retrained models remain susceptible to re-contamination. We derive the basic reproduction number $R_0 = \sqrt{\beta_D \beta_M / [(\gamma_D+\mu_D)(\gamma_M+\mu_M)]}$ via the Next Generation Matrix and apply standard epidemic threshold results to the bilayer system.

SIRS 变体（我们的主要推荐）纳入了免疫衰减机制，反映了经过过滤的语料库和重新训练的模型仍然容易受到再次污染。我们通过下一代矩阵推导出了基本再生数 $R_0 = \sqrt{\beta_D \beta_M / [(\gamma_D+\mu_D)(\gamma_M+\mu_M)]}$，并将标准的流行病阈值结果应用于该双层系统。

Illustrative scenario-based calibration from public AI text prevalence data yields supercritical dynamics ($R_0 > 1$) across three scenarios; Sobol sensitivity analysis identifies synthetic-text detection as the highest-leverage parameter. A bipartite-network agent-based model confirms mean-field consistency ($R^2 > 0.96$) for dense networks but degrades under heterogeneity.

基于公共 AI 文本流行度数据的说明性场景校准在三种场景下均产生了超临界动力学（$R_0 > 1$）；Sobol 敏感性分析确定合成文本检测是影响最大的参数。基于二分网络的智能体模型证实了密集网络下的平均场一致性（$R^2 > 0.96$），但在异构环境下表现会下降。

GPT-2 contamination chain experiments (192 runs across WikiText and Shakespeare) show dose-response degradation and diversity loss qualitatively consistent with the threshold picture. Matched-budget source-diversity experiments (1,088 runs) provide suggestive evidence that multi-source mixing modestly attenuates collapse, but the effect vanishes at lower contamination fractions. Intervention analysis identifies detection-based filtering and herd immunity as the highest-leverage strategies.

GPT-2 污染链实验（在 WikiText 和莎士比亚数据集上进行了 192 次运行）显示，剂量反应退化和多样性损失在定性上与阈值模型一致。匹配预算的源多样性实验（1,088 次运行）提供了暗示性证据，表明多源混合可以适度缓解崩溃，但这种效应在较低的污染比例下会消失。干预分析确定基于检测的过滤和群体免疫是最高效的策略。