SQL patterns I use to catch transaction fraud
SQL patterns I use to catch transaction fraud
我用来捕捉交易欺诈的 SQL 模式
Quick disclaimer: I do data work on a program-integrity team. Examples below use generic transaction tables and made-up scenarios. Nothing here comes from anything I’ve actually worked on or seen. Views are mine, not my employer’s. 免责声明:我在一个项目诚信团队从事数据工作。以下示例使用通用的交易表和虚构场景。文中内容均非我实际工作或见闻。观点仅代表个人,不代表雇主。
Fraud detection in transaction data is mostly SQL. Not machine learning, not graph databases, not whatever Gartner is hyping this year. SQL, run against the right tables, with the right joins, looking for the right shapes. I work mostly with government-funded benefit programs, but the patterns below port over to anything with a transactions table: credit cards, healthcare claims, e-commerce, point-of-sale. If money moves and gets logged, these queries will find weird things in the log. 交易数据中的欺诈检测主要依靠 SQL。不是机器学习,不是图数据库,也不是 Gartner 今年炒作的任何技术。只需针对正确的表,使用正确的连接(join),寻找正确的模式即可。我主要处理政府资助的福利项目,但以下模式适用于任何拥有交易表的场景:信用卡、医疗索赔、电子商务、销售终端(POS)。只要有资金流动并被记录,这些查询就能在日志中发现异常。
Six patterns. Roughly in the order I’d build them out on a new dataset. 共有六种模式。大致按照我在处理新数据集时的构建顺序排列。
1. Velocity (速度)
The simplest one. Someone with a stolen card wants to drain it before the holder notices. So they hit the card fast. 这是最简单的一种。盗刷者想要在持卡人发现之前耗尽卡内余额,因此他们会高频使用该卡。
SELECT cardholder_id, date_trunc('hour', timestamp) AS hour_bucket, count(*) AS tx_count, min(timestamp) AS first_tx, max(timestamp) AS last_tx
FROM transactions
WHERE timestamp >= current_date - INTERVAL '30 days'
GROUP BY 1, 2
HAVING count(*) > 10;
Tune two knobs: the window size and the count threshold. I usually run a 1-minute, 5-minute, and 1-hour version in parallel and compare. Different fraud shows up at different scales — a card-testing ring hits a server in seconds; a benefits-trafficking ring might take an afternoon. 调整两个参数:时间窗口大小和计数阈值。我通常会并行运行 1 分钟、5 分钟和 1 小时的版本进行对比。不同类型的欺诈在不同尺度下显现——信用卡测试团伙会在几秒钟内攻击服务器;而福利套现团伙可能需要一下午。
A few cardholders will legitimately blow past the threshold. Route operators servicing vending machines. People reloading prepaid cards in bulk. Your false positives. Worth keeping a whitelist after the first pass. 少数持卡人会合法地超过阈值,例如维护自动售货机的路线运营商,或批量充值预付卡的用户。这些是你的误报。在第一轮筛选后,建立一个白名单很有必要。
For sliding-window velocity, this is the form I use: 对于滑动窗口速度检测,我使用以下形式:
SELECT cardholder_id, timestamp, count(*) OVER (
PARTITION BY cardholder_id
ORDER BY timestamp
RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW
) AS tx_in_last_5min
FROM transactions
QUALIFY tx_in_last_5min >= 5
ORDER BY cardholder_id, timestamp;
QUALIFY works in Snowflake, BigQuery, Databricks, Teradata. For Postgres you wrap the whole thing in a CTE and filter on the outside. Slight pain, same result. QUALIFY 适用于 Snowflake、BigQuery、Databricks 和 Teradata。对于 Postgres,你需要将整个查询包装在 CTE 中并在外部进行过滤。虽然稍微麻烦一点,但结果是一样的。
2. Impossible travel (不可能的旅行)
If a card swipes in Chicago and seven minutes later swipes in Los Angeles, one of those swipes is fake. The card is cloned. This is the most uncontroversial fraud signal you’ll find — there’s almost no legitimate reason a single card is in two distant places in seven minutes. 如果一张卡在芝加哥刷卡,七分钟后又在洛杉矶刷卡,那么其中一笔肯定是伪造的。这张卡被克隆了。这是你能找到的最无可争议的欺诈信号——几乎没有任何正当理由能让同一张卡在七分钟内出现在两个遥远的地方。
WITH ordered_tx AS (
SELECT cardholder_id, timestamp, location,
LAG(timestamp) OVER (PARTITION BY cardholder_id ORDER BY timestamp) AS prev_ts,
LAG(location) OVER (PARTITION BY cardholder_id ORDER BY timestamp) AS prev_loc
FROM transactions
)
SELECT cardholder_id, prev_ts AS first_tx, timestamp AS second_tx, prev_loc AS first_location, location AS second_location,
EXTRACT(EPOCH FROM (timestamp - prev_ts)) / 60 AS minutes_apart,
haversine(prev_loc, location) AS miles_apart
FROM ordered_tx
WHERE prev_ts IS NOT NULL AND prev_loc <> location
AND haversine(prev_loc, location) / nullif(EXTRACT(EPOCH FROM (timestamp - prev_ts)), 0) * 3600 > 600;
haversine is the great-circle distance function. Most warehouses ship one. If yours doesn’t, it’s about ten lines to write your own. The 600 mph threshold is rough — commercial jet cruise is around 575, so this is “faster than a plane could possibly do it.” Haversine 是大圆距离函数。大多数数据仓库都内置了该函数。如果没有,自己写一个也就十行代码左右。600 英里/小时的阈值是一个粗略的估算——商用喷气式飞机的巡航速度约为 575 英里/小时,所以这代表“比飞机可能达到的速度还要快”。
You can tighten it to 100 mph if you want to catch suspiciously-fast ground travel too, but at that threshold you start picking up real airline travelers, kids with parents driving them home from camp, that kind of thing. 如果你想捕捉可疑的快速地面移动,可以将阈值收紧到 100 英里/小时,但在这个阈值下,你可能会误伤真正的航空旅客、被父母从夏令营接回家的孩子等情况。
A few other shapes in the same family are worth running: 同类模式中还有几种值得运行的检查:
- Two distant cities, same state, inside 5 minutes. Local cloning rings. (同一州内两个遥远城市,5 分钟内刷卡。本地克隆团伙。)
- Multiple ZIP codes inside an hour. Skimmer rings working a region. (一小时内出现多个邮政编码。在区域内活动的盗刷团伙。)
- Border crossings inside 10 minutes. International rings. (10 分钟内跨越边境。国际团伙。)
3. Amount anomalies (金额异常)
There are a couple of amounts that show up disproportionately in fraud and almost never in normal use. 有几种金额在欺诈中出现的比例极高,而在正常使用中几乎从不出现。
SELECT cardholder_id, timestamp, amount, merchant_id
FROM transactions
WHERE (amount >= 99.50 AND amount < 100.00)
OR (amount >= 499.50 AND amount < 500.00)
OR amount IN (1.00, 5.00, 10.00)
ORDER BY cardholder_id, timestamp;
What’s happening: Round dollar amounts at small values — $1.00, $5.00, $10.00 — are almost always card tests. Someone got a card number from a dump and they’re checking if it works before reselling it. Real cardholders almost never buy something for exactly $1.00. Coffee is $4.73, gas is $52.81. The roundness is the signal. 发生了什么:小额的整数金额——1.00 美元、5.00 美元、10.00 美元——几乎总是信用卡测试。有人从数据库泄露中获取了卡号,并在转售前检查其是否有效。真正的持卡人几乎从不购买正好 1.00 美元的商品。咖啡是 4.73 美元,汽油是 52.81 美元。这种“整数”本身就是信号。
Amounts just below a threshold are different. $99.99 is interesting because at a lot of places, $100 is the line where the cashier is supposed to check ID. $499.99 is interesting because $500 is often a daily ATM cap. Whoever’s doing the transaction knows the rules and is staying under them. 略低于阈值的金额则不同。99.99 美元很有意思,因为在很多地方,100 美元是收银员需要核对身份证的界限。499.99 美元也很有意思,因为 500 美元通常是 ATM 的每日取款上限。进行交易的人了解这些规则,并试图保持在限额之下。
4. Suspicious merchants (可疑商户)
When a skimmer compromises a card reader at, say, a gas pump, you don’t get one fraud case. You get dozens. Every card swiped at that pump for the next few weeks is now in someone’s database. So the symptom from the merchant side is: an unusual number of unrelated cards spending more than usual, in a short window. 当盗刷器破坏了读卡器(例如在加油泵上)时,你面对的不会是一个欺诈案例,而是几十个。在接下来的几周里,在该加油泵刷过的每一张卡现在都在某人的数据库里了。因此,从商户端来看,症状是:在短时间内,大量不相关的卡片进行了比平时更多的消费。
SELECT merchant_id, date_trunc('hour', timestamp) AS hour_bucket, count(DISTINCT cardholder_id) AS unique_cards, count(*) AS total_tx, sum(amount) AS total_amount
FROM transactions
WHERE timestamp >= current_date - INTERVAL '7 days'
GROUP BY 1, 2
HAVING count(DISTINCT cardholder_id) > 20 AND sum(amount) > 5000
ORDER BY total_amount DESC;
The problem with static thresholds (20 unique cards, $5000) is they don’t account for size. A Costco does that in 90 seconds. A used bookshop, never. So the better version compares each merchant against itself: 静态阈值(20 张独立卡,5000 美元)的问题在于它们没有考虑商户规模。Costco 在 90 秒内就能达到这个数字,而二手书店永远不会。因此,更好的版本是将每个商户与自身进行比较:
(Note: The original text cut off here, so the translation ends accordingly.) (注:原文在此处中断,故翻译至此结束。)