Why your synthetic fintech data fails code review (and how mixture models fix it)
Why your synthetic fintech data fails code review (and how mixture models fix it)
为什么你的合成金融科技数据无法通过代码审查(以及混合模型如何解决它)
Every fintech developer has done this: you need test data, you reach for Faker, you generate ten thousand transactions, and your demo works. Then a data scientist on the buying side opens your dataset, runs one df.describe(), and the deal-killing question arrives: “Why are your transaction amounts uniformly distributed?”
每个金融科技开发者都做过这样的事:你需要测试数据,于是使用 Faker 生成一万条交易记录,演示运行得很顺利。然而,买方的数据科学家打开你的数据集,运行了一次 df.describe(),随即抛出了那个足以让交易告吹的问题:“为什么你的交易金额是均匀分布的?”
Real financial data has a shape. Synthetic data that ignores that shape is instantly recognizable — and in testing, ML training, or sales demos, instantly discrediting. I spent nine years running a savings app in Latin America (30,000+ users, 2015–2024), and when it wound down I kept something most synthetic data generators never had: 506,311 real records to measure that shape against. This post is about the three statistical properties that separate believable synthetic financial data from Faker output, with the actual numbers. 真实的金融数据是有“形态”的。忽略这种形态的合成数据一眼就能被识破——无论是在测试、机器学习训练还是销售演示中,这都会瞬间丧失可信度。我在拉丁美洲运营过一款储蓄应用长达九年(2015–2024年,用户超过3万),当应用关闭时,我保留了一份大多数合成数据生成器从未拥有的宝贵资产:506,311 条真实记录,用于衡量数据的真实形态。本文将通过实际数据,探讨区分“可信的合成金融数据”与“Faker 输出”的三大统计特性。
Property 1: Amounts are multimodal, not lognormal. The standard “sophisticated” approach is to sample amounts from a lognormal distribution. It’s better than uniform — and it still fails. When I fitted a single lognormal to 261,070 real deposits, the body of the distribution looked fine (7–10% deviation between p25 and p90), but the tail fell apart: 35–45% deviation at p95–p99. 特性 1:金额是多峰分布的,而非对数正态分布。标准的“进阶”做法是从对数正态分布中采样金额。这比均匀分布要好,但依然会失败。当我将 261,070 笔真实存款拟合到单一的对数正态分布时,分布的主体看起来还不错(p25 到 p90 之间的偏差为 7–10%),但尾部完全失真:在 p95 到 p99 处偏差高达 35–45%。
The reason is that “deposit amount” isn’t one population. It’s at least three: micro-deposits (the $1–$20 spare-change crowd), typical deposits ($100–$800), and large transfers ($6,000+). Each has its own location and spread. A single lognormal averages across them and gets all of them wrong. 原因在于“存款金额”并非单一群体,至少包含三类:微额存款(1–20 美元的零钱用户)、常规存款(100–800 美元)以及大额转账(6,000 美元以上)。每一类都有其特定的位置和离散度。单一的对数正态分布将它们平均化,导致每一类都无法准确拟合。
The fix is a mixture of lognormals. Fit GaussianMixture from scikit-learn on the log-amounts, select the number of components, sample from the mixture. One non-obvious lesson from doing this on real data: don’t select K with BIC. Financial amounts have heavy atoms at round values (more on that below), and BIC reacts to those atoms by under-fitting the number of components. Selecting K by minimizing the Kolmogorov–Smirnov statistic against a held-out sample worked far better: a 6-component mixture brought deposits from KS=0.068 down to KS=0.032, and p99 deviation from ~45% to under 5%.
解决方法是使用对数正态混合模型。在对数金额上拟合 scikit-learn 的 GaussianMixture,选择组件数量,然后从混合模型中采样。在处理真实数据时,一个不那么显而易见的经验是:不要使用 BIC(贝叶斯信息准则)来选择 K 值。金融金额在整数值处有密集的分布点(下文会详述),而 BIC 会因为这些密集点而导致组件数量拟合不足。通过最小化保留样本的 Kolmogorov–Smirnov 统计量来选择 K 值效果要好得多:6 组件混合模型将存款的 KS 值从 0.068 降低到了 0.032,并将 p99 的偏差从约 45% 降低到了 5% 以下。