Why your synthetic fintech data fails code review (and how mixture models fix it)

为什么你的合成金融科技数据无法通过代码审查（以及混合模型如何解决它）

Every fintech developer has done this: you need test data, you reach for Faker, you generate ten thousand transactions, and your demo works. Then a data scientist on the buying side opens your dataset, runs one df.describe(), and the deal-killing question arrives: “Why are your transaction amounts uniformly distributed?” 每个金融科技开发者都做过这样的事：你需要测试数据，于是使用 Faker 生成一万条交易记录，演示运行得很顺利。然而，买方的数据科学家打开你的数据集，运行了一次 df.describe()，随即抛出了那个足以让交易告吹的问题：“为什么你的交易金额是均匀分布的？”

Real financial data has a shape. Synthetic data that ignores that shape is instantly recognizable — and in testing, ML training, or sales demos, instantly discrediting. I spent nine years running a savings app in Latin America (30,000+ users, 2015–2024), and when it wound down I kept something most synthetic data generators never had: 506,311 real records to measure that shape against. This post is about the three statistical properties that separate believable synthetic financial data from Faker output, with the actual numbers. 真实的金融数据是有“形态”的。忽略这种形态的合成数据一眼就能被识破——无论是在测试、机器学习训练还是销售演示中，这都会瞬间丧失可信度。我在拉丁美洲运营过一款储蓄应用长达九年（2015–2024年，用户超过3万），当应用关闭时，我保留了一份大多数合成数据生成器从未拥有的宝贵资产：506,311 条真实记录，用于衡量数据的真实形态。本文将通过实际数据，探讨区分“可信的合成金融数据”与“Faker 输出”的三大统计特性。

Property 1: Amounts are multimodal, not lognormal. The standard “sophisticated” approach is to sample amounts from a lognormal distribution. It’s better than uniform — and it still fails. When I fitted a single lognormal to 261,070 real deposits, the body of the distribution looked fine (7–10% deviation between p25 and p90), but the tail fell apart: 35–45% deviation at p95–p99. 特性 1：金额是多峰分布的，而非对数正态分布。标准的“进阶”做法是从对数正态分布中采样金额。这比均匀分布要好，但依然会失败。当我将 261,070 笔真实存款拟合到单一的对数正态分布时，分布的主体看起来还不错（p25 到 p90 之间的偏差为 7–10%），但尾部完全失真：在 p95 到 p99 处偏差高达 35–45%。

The reason is that “deposit amount” isn’t one population. It’s at least three: micro-deposits (the $1–$20 spare-change crowd), typical deposits ($100–$800), and large transfers ($6,000+). Each has its own location and spread. A single lognormal averages across them and gets all of them wrong. 原因在于“存款金额”并非单一群体，至少包含三类：微额存款（1–20 美元的零钱用户）、常规存款（100–800 美元）以及大额转账（6,000 美元以上）。每一类都有其特定的位置和离散度。单一的对数正态分布将它们平均化，导致每一类都无法准确拟合。

The fix is a mixture of lognormals. Fit GaussianMixture from scikit-learn on the log-amounts, select the number of components, sample from the mixture. One non-obvious lesson from doing this on real data: don’t select K with BIC. Financial amounts have heavy atoms at round values (more on that below), and BIC reacts to those atoms by under-fitting the number of components. Selecting K by minimizing the Kolmogorov–Smirnov statistic against a held-out sample worked far better: a 6-component mixture brought deposits from KS=0.068 down to KS=0.032, and p99 deviation from ~45% to under 5%. 解决方法是使用对数正态混合模型。在对数金额上拟合 scikit-learn 的 GaussianMixture，选择组件数量，然后从混合模型中采样。在处理真实数据时，一个不那么显而易见的经验是：不要使用 BIC（贝叶斯信息准则）来选择 K 值。金融金额在整数值处有密集的分布点（下文会详述），而 BIC 会因为这些密集点而导致组件数量拟合不足。通过最小化保留样本的 Kolmogorov–Smirnov 统计量来选择 K 值效果要好得多：6 组件混合模型将存款的 KS 值从 0.068 降低到了 0.032，并将 p99 的偏差从约 45% 降低到了 5% 以下。