A Filtered Mixture-of-Generators for Fully Synthetic Survival Training

用于全合成生存分析训练的过滤式生成器混合模型

Abstract: Survival analysis models time-to-event data, but in clinical settings training data are costly and scarce: events accrue over years of follow-up, cohorts are small, and privacy regulations restrict sharing across institutions. Tabular generative models promise augmentation and privacy-preserving cohort sharing, yet are themselves data-hungry — on the small cohorts typical of survival analysis, a single generator rarely characterizes the population well enough for downstream models trained on its output to match real-data performance.

摘要： 生存分析用于对事件发生时间数据进行建模，但在临床环境中，训练数据既昂贵又稀缺：事件的积累需要多年的随访，队列规模较小，且隐私法规限制了机构间的数据共享。表格生成模型有望实现数据增强和隐私保护下的队列共享，但它们本身非常依赖数据——在生存分析常见的这种小规模队列中，单一生成器往往难以充分刻画总体分布，导致基于其输出训练的下游模型无法达到与真实数据训练相当的性能。

FoGS (Filtered Mixture-of-Generators for Survival analysis) reframes synthetic-data construction as sample selection rather than generation. A candidate pool is drawn from four architecturally distinct tabular generators, and each sample is scored by an ensemble of seven survival models trained on real data, using proper scoring rules as a per-sample plausibility proxy. A two-level pipeline optimizes, in its outer loop, a selection policy — generator quotas, scorer weights, a random complement, and stratified balancing on event time and censoring — against held-out downstream performance, while an inner loop tunes the downstream model (XGBoost-Cox).

FoGS（用于生存分析的过滤式生成器混合模型）将合成数据的构建重新定义为“样本选择”而非“生成”。该方法从四个架构各异的表格生成器中提取候选池，并利用在真实数据上训练的七个生存模型组成的集成系统对每个样本进行评分，使用适当的评分规则作为衡量样本合理性的代理指标。该方法采用两级流水线：外层循环优化选择策略（包括生成器配额、评分器权重、随机补集以及针对事件时间和删失的分层平衡），以最大化留出集上的下游性能；内层循环则用于调整下游模型（XGBoost-Cox）。

On 16 public datasets under train-on-synthetic, test-on-real (C-index and IBS, $0$—$100$ scale), FoGS yields mean improvements of $+2.17$ in C-index and $+0.67$ in IBS, improving both metrics on 9 of 16 datasets and at least one on 13 (one-sided Wilcoxon $p=0.039$ and $p=0.035$). It matches or exceeds real-data training on most cohorts, with no significant change in nearest-neighbour privacy margin relative to unfiltered sampling. Sample filtering over a heterogeneous generator pool is thus a viable substitute for real-data training in privacy-restricted clinical settings.

在 16 个公共数据集上进行的“合成数据训练、真实数据测试”（C-index 和 IBS 指标，量程 $0$—$100$）实验表明，FoGS 使 C-index 平均提升了 $+2.17$，IBS 平均提升了 $+0.67$。在 16 个数据集中的 9 个上，两项指标均得到改善；在 13 个数据集上，至少有一项指标得到改善（单侧 Wilcoxon 检验 $p=0.039$ 和 $p=0.035$）。在大多数队列中，该方法的效果达到或超过了真实数据训练，且相对于未过滤的采样，其最近邻隐私边界没有显著变化。因此，在隐私受限的临床环境中，基于异构生成器池的样本过滤是一种可行的真实数据训练替代方案。