The Hot Path Belongs to GBDTs, Agents Own the Cold Path: A Payment-Fraud Benchmark
The Hot Path Belongs to GBDTs, Agents Own the Cold Path: A Payment-Fraud Benchmark
热门路径属于 GBDT,冷路径属于智能体:支付欺诈基准测试
A reproducible benchmark on latency, cost, and reproducibility, and where agents actually earn their keep. 这是一项关于延迟、成本和可复现性的可复现基准测试,旨在探讨智能体(Agents)究竟在何处能发挥真正的价值。
This article tests a question that keeps coming up in payments work: can an LLM agent replace a gradient-boosted scorer on the synchronous payment authorization path? The question has a reasonable shape. Agents are handling investigation queues that used to need a senior analyst and five dashboards, so it sounds like a natural fit for scoring the transaction too. 本文旨在验证支付领域中一个反复出现的问题:LLM 智能体能否在同步支付授权路径中取代梯度提升(Gradient-Boosted)评分器?这个问题本身很有逻辑。智能体目前正在处理以往需要资深分析师配合五个仪表盘才能完成的调查队列,因此将其用于交易评分听起来似乎是一个自然的选择。
I built a small benchmark to answer it. The benchmark runs on a laptop. It needs no GPU, no API key, and no cloud account. The source code is on GitHub at github.com/sandeepmb/fraud-agents-benchmark. Every figure and number in this article comes out of the same Python repo, so you can rerun it and check the work. 我构建了一个小型基准测试来回答这个问题。该测试可以在笔记本电脑上运行,无需 GPU、API 密钥或云账户。源代码已托管在 GitHub 上(github.com/sandeepmb/fraud-agents-benchmark)。本文中的每一个图表和数字均来自同一个 Python 仓库,因此你可以自行运行并验证结果。
The short answer is that classical ML still owns the synchronous hot path, and agents belong in the asynchronous cold path. The rest of this article explains the three measurements that draw the line between those two layers, and the hybrid architecture I ended up recommending. 简短的回答是:传统机器学习依然占据着同步“热路径”(Hot Path),而智能体则属于异步“冷路径”(Cold Path)。本文的其余部分将解释划分这两个层级的三个衡量指标,以及我最终推荐的混合架构。
TL;DR
简要总结
-
On a single CPU core, the gradient-boosted scorer hits p99 latency of 0.15 ms. A calibrated LLM-latency simulator (not a live API) puts an LLM scorer at p99 around 1,200 ms. The ISO 8583 authorization budget is roughly 100 ms. 在单核 CPU 上,梯度提升评分器的 p99 延迟为 0.15 毫秒。经过校准的 LLM 延迟模拟器(非实时 API)显示 LLM 评分器的 p99 延迟约为 1,200 毫秒。而 ISO 8583 授权预算大约为 100 毫秒。
-
At 50,000 transactions per second for one hour, the GBDT scorer costs about $54. A gpt-4o-mini-class model costs $16,200. A frontier model (Claude Sonnet 4.6) costs $351,000. These figures assume bare scoring. Agentic reasoning multiplies them. 以每秒 50,000 笔交易、持续一小时计算,GBDT 评分器的成本约为 54 美元。gpt-4o-mini 级别的模型成本为 16,200 美元,而前沿模型(Claude Sonnet 4.6)则高达 351,000 美元。这些数字仅假设为基础评分,智能体推理还会成倍增加这些成本。
-
On 500 calls with the bit-identical input, the GBDT returns 1 distinct score. A non-deterministic LLM returns 498. Hosted LLM inference can stay non-deterministic even when temperature is set to 0, which makes a hot-path scorer hard to validate in a regulated authorization decision. 在 500 次完全相同的输入调用中,GBDT 返回 1 个确定的分数,而非确定性的 LLM 返回了 498 个不同的分数。即使将温度(temperature)设置为 0,托管的 LLM 推理仍可能保持非确定性,这使得热路径评分器在受监管的授权决策中难以通过验证。
-
Agents do useful work on the asynchronous cold path: SAR drafting, evidence gathering through MCP-typed tools, and an agent-as-a-judge pass before human sign-off. 智能体在异步冷路径中大有可为:例如撰写可疑活动报告(SAR)、通过 MCP 类型工具收集证据,以及在人工审核前进行“智能体作为法官”的预审环节。
Scope and Limits
范围与局限
Four honest boundaries before the results. This is not a claim that LLMs cannot help fraud teams. The second half of this article is about where they clearly do. It is also not a comparison against fine-tuned tabular transformers or deep-learning tabular models. The comparison is between a deterministic gradient-boosted scorer and LLM-style scoring in synchronous authorization. 在公布结果前,先明确四个边界。这并非声称 LLM 无法帮助欺诈团队——本文后半部分将专门讨论它们确实能发挥作用的地方。此外,这也不是与微调后的表格 Transformer 或深度学习表格模型的对比,而是针对同步授权中“确定性梯度提升评分器”与“LLM 式评分”的对比。
The GBDT path is measured on a local CPU. The LLM latency path is simulated from a calibrated distribution, not measured against a live API. The cost figures are calculated from published per-token pricing. Determinism is shown two ways: measured locally for the GBDT, and for the LLM reproduced by the simulator and supported by external evidence. GBDT 路径是在本地 CPU 上测量的。LLM 延迟路径是通过校准后的分布模拟得出的,而非针对实时 API 的测量。成本数据根据已发布的每 Token 定价计算。确定性通过两种方式展示:GBDT 在本地测量,LLM 则由模拟器复现并辅以外部证据支持。
| Component | Measured, simulated, or calculated | Why |
|---|---|---|
| GBDT latency | Measured | Local single-core CPU benchmark |
| LLM latency | Simulated | Calibrated log-normal, no API or GPU dependency |
| Cost | Calculated | Published May-2026 per-token pricing |
| Determinism | Measured (GBDT) and cited evidence (LLM) | Local benchmark plus |
| 组件 | 测量、模拟或计算 | 原因 |
|---|---|---|
| GBDT 延迟 | 测量 | 本地单核 CPU 基准测试 |
| LLM 延迟 | 模拟 | 校准后的对数正态分布,无 API 或 GPU 依赖 |
| 成本 | 计算 | 2026 年 5 月发布的每 Token 定价 |
| 确定性 | 测量 (GBDT) 及引用证据 (LLM) | 本地基准测试及其他 |
The Setup
设置
I wanted a benchmark anyone could rerun without an A100 or an OpenAI API key. That meant three design choices. The data is synthetic and ISO 8583-shaped. Twenty features per transaction, the kinds of fields a card-not-present hot-path scorer actually sees: amount, MCC risk, device age, geo-distance, velocity counters at one-hour and twenty-four-hour windows, chargeback history, and a handful of binary flags. 我希望构建一个任何人无需 A100 或 OpenAI API 密钥即可重新运行的基准测试。这意味着需要做出三个设计选择。数据是合成的,且符合 ISO 8583 格式。每笔交易包含 20 个特征,即“无卡支付”热路径评分器实际会看到的字段:金额、MCC 风险、设备年龄、地理距离、一小时和二十四小时窗口的速率计数器、拒付历史以及少量二进制标志。
Fraud rate is 1.5%. The generator includes a stealth-fraud rate parameter so that about 15% of fraud rows are drawn from the legit-class distribution. This mirrors sophisticated mimicry and gives the benchmark an irreducible Bayes-optimal error floor. Without it, a tree ensemble lands at PR-AUC around 0.999, which would make the whole exercise look fake. 欺诈率为 1.5%。生成器包含一个“隐蔽欺诈率”参数,使得约 15% 的欺诈行数据是从合法类分布中抽取的。这模拟了复杂的模仿行为,并为基准测试提供了一个不可约的贝叶斯最优误差下限。如果没有这一点,树集成模型的 PR-AUC 将达到 0.999 左右,这会让整个测试看起来很假。
# src/fraud_benchmark/data.py (abridged)
def generate(n_rows, fraud_rate=0.015, seed=42, stealth_rate=0.15):
rng = np.random.default_rng(seed)
n_fraud = int(round(n_rows * fraud_rate))
n_stealth = int(round(n_fraud * stealth_rate))
legit = _draw_class(rng, n_rows - n_fraud, is_fraud=False)
overt = _draw_class(rng, n_fraud - n_stealth, is_fraud=True)
stealth = _draw_class(rng, n_stealth, is_fraud=False) # mimicry ...
After training a HistGradientBoostingClassifier on 200,000 rows of this distribution, the model lands at PR-AUC 0.847 and ROC-AUC 0.931 on a 50,000-row holdout. Those are credible numbers for a production card-not-present scorer. The scorer itself uses a fast batch=1 path. Calling sklearn’s predict_proba on a single row takes around 14 ms on this laptop, dominated by Python validation overhead. That number is unrepresentative of XGBoost or LightGBM in production, so for a fair comparison I extracted the trained model’s internal trees into per-field numpy arrays and wrote a tight traversal. It matches sklearn to float64 precision and runs about 100 times faster. 在使用 200,000 行该分布数据训练 HistGradientBoostingClassifier 后,模型在 50,000 行留出集上的 PR-AUC 为 0.847,ROC-AUC 为 0.931。对于生产环境中的无卡支付评分器而言,这些数字是可信的。评分器本身使用快速的 batch=1 路径。在这台笔记本电脑上,对单行数据调用 sklearn 的 predict_proba 大约需要 14 毫秒,主要耗时在于 Python 的验证开销。这个数字不能代表生产环境中的 XGBoost 或 LightGBM,因此为了公平比较,我将训练好的模型内部树结构提取为按字段排列的 numpy 数组,并编写了一个紧凑的遍历算法。它在 float64 精度上与 sklearn 一致,且运行速度快了约 100 倍。
The LLM scorer is simulated. This is the only place where running everything on a laptop required calibration rather than measurement. The simulator samples per-call latency from a log-normal distribution with a 540 ms median and σ =0.35. The calibration draws on three public sources: NVIDIA Triton’s published time-to-first-token figures for Llama-3-8B q4 on an A10, vLLM benchmarks for Qwen2.5-7B on an RTX 4090, and the p50 and p99 numbers OpenAI and Anthropic publish for their hosted APIs. The simulator also produces non-deterministic score outputs on identical inputs, which is what we need for the determinism experiment. LLM 评分器是模拟的。这是唯一需要校准而非直接测量的地方,因为要在笔记本电脑上运行所有内容。模拟器从对数正态分布中采样单次调用延迟,中位数为 540 毫秒,σ = 0.35。校准参考了三个公开来源:NVIDIA Triton 发布的 Llama-3-8B q4 在 A10 上的首 Token 时间、vLLM 在 RTX 4090 上对 Qwen2.5-7B 的基准测试,以及 OpenAI 和 Anthropic 为其托管 API 发布的 p50 和 p99 数据。模拟器还能在相同输入下产生非确定性的评分输出,这正是我们进行确定性实验所需要的。
With that setup, three experiments. Break #1: Inference Sits Outside the ISO 8583 Budget. Five thousand single-transaction calls to the GBDT scorer on one CPU core at batch size 1. Four hundred draws from the calibrated LLM latency distribution. The entire measured GBDT distribution sits to the left of the 100 ms ISO. 基于上述设置,进行了三项实验。突破点 #1:推理超出了 ISO 8583 的预算。在单核 CPU 上以 batch=1 对 GBDT 评分器进行了 5,000 次单交易调用,并从校准后的 LLM 延迟分布中抽取了 400 个样本。测量结果显示,整个 GBDT 分布均位于 100 毫秒 ISO 预算的左侧。