I Pitted XGBoost Against Logistic Regression on 358 Matches. The Boring Model Won.
I Pitted XGBoost Against Logistic Regression on 358 Matches. The Boring Model Won.
我用 358 场比赛对比了 XGBoost 和逻辑回归,结果“无聊”的模型赢了
There’s a reflex most of us share on a new modelling problem: reach for the model that wins. These days that’s gradient boosting, and the reflex is usually right — XGBoost earns its reputation on a staggering range of problems. So when I lined up five classifiers on the same task and the one-line linear model beat the Kaggle champion, the result was the kind that surprises exactly nobody who has shipped models on real data, and almost everybody still learning. 我们大多数人在面对新的建模问题时都有一个共同的本能反应:直接选择那个“常胜”模型。如今,这个模型通常是梯度提升(Gradient Boosting),而且这种本能通常是正确的——XGBoost 在各种各样的问题上都赢得了极高的声誉。因此,当我把五个分类器放在同一个任务上进行对比,而那个只有一行的线性模型却击败了 Kaggle 冠军时,这个结果对于任何在真实数据上部署过模型的人来说一点也不意外,但对于大多数初学者来说却非常令人惊讶。
Five classifiers, same task, same features: predict whether an international match ends in a home win, draw, or away win. The contenders ran from a humble logistic regression up through a random forest, KNN, a small neural network, and XGBoost. The simplest one won. More interesting than that it won is why — and the why is one of the most useful ideas in applied machine learning. Here’s the experiment, the result, and the theory that cracks it open. 五个分类器,同样的任务,同样的特征:预测一场国际比赛是以主胜、平局还是客胜告终。参赛选手从谦逊的逻辑回归,到随机森林、KNN、小型神经网络,再到 XGBoost。最简单的那个赢了。比“它赢了”更有趣的是“为什么赢”——而这个原因正是应用机器学习中最有用的概念之一。以下是实验过程、结果以及揭示其背后原理的理论。
The setup
实验设置
This came out of building a suite of eleven World Cup models, where I needed a result classifier and wanted to know which family to trust. Each model saw the same three features for 358 historical internationals — the 2010–2022 World Cups plus the 2020 and 2024 Euros: the strength gap between the teams, their combined strength, and a knockout flag. The target is the three-way result. 这是在构建一套 11 个世界杯模型时产生的需求,我需要一个结果分类器,并想知道该信任哪一类模型。每个模型都针对 358 场历史国际比赛(2010-2022 年世界杯,加上 2020 和 2024 年欧洲杯)使用了相同的三个特征:球队之间的实力差距、球队综合实力以及淘汰赛标志。目标是预测这三种比赛结果。
I scored them with 5-fold cross-validation, and the primary metric is log-loss, not accuracy. That choice does a lot of work in this article, so it’s worth being explicit about it up front. Accuracy only asks whether the top-ranked class was correct. Log-loss grades the entire probability vector and punishes confident mistakes hard. 我使用 5 折交叉验证对它们进行了评分,主要指标是对数损失(log-loss)而非准确率。这个选择在本文中至关重要,因此有必要提前说明。准确率只关注排名最高的类别是否正确,而对数损失则会对整个概率向量进行评分,并严厉惩罚那些“自信但错误”的预测。
For a forecasting model whose entire job is to emit calibrated probabilities, log-loss is the honest scorecard and accuracy is a sanity check. The number to keep in your pocket is ln(3) ≈ 1.099 — the log-loss you’d get by shrugging and predicting a uniform 1/3 across the three classes. Beat 1.099 and your model knows something. Score above it and you’d have been better off guessing. 对于一个旨在输出校准概率的预测模型来说,对数损失是诚实的记分卡,而准确率只是一个合理性检查。你需要记住的数字是 ln(3) ≈ 1.099——这是当你对三个类别进行均匀的 1/3 预测时会得到的对数损失。如果结果低于 1.099,说明你的模型学到了一些东西;如果高于这个值,那你还不如直接瞎猜。
The result
实验结果
There are two things in the results below that should bother you. The first is the podium: a plain logistic regression posted the best log-loss, and XGBoost — the model that wins Kaggle competitions — came last. The second is stranger and easy to skim past. XGBoost didn’t just lose; it scored above 1.099, the uniform-guessing baseline. A model with a respectable-looking 48% accuracy was, by the metric that actually matters here, worse than a coin with three sides. 下表中的结果有两点值得你深思。第一是排名:简单的逻辑回归获得了最好的对数损失,而 Kaggle 比赛的常胜将军 XGBoost 却垫底了。第二点更奇怪,也容易被忽略:XGBoost 不仅输了,它的得分还高于 1.099(均匀猜测基准线)。一个看起来有 48% 准确率的模型,按照这里真正重要的指标来看,表现甚至不如一个三面硬币。
(Table omitted for brevity) (此处省略表格)
Both of these facts have the same root cause, and it’s the most useful idea in this whole article. 这两个事实有着共同的根本原因,这也是整篇文章中最有用的概念。
Why the boring model won: bias and variance
为什么“无聊”的模型赢了:偏差与方差
The clean way to think about this is the bias–variance decomposition. A model’s expected out-of-sample error splits into three parts: Error = Bias² + Variance + Irreducible noise. 理解这一点的清晰方式是“偏差-方差分解”。模型预期的样本外误差可以分为三部分:误差 = 偏差² + 方差 + 不可约噪声。
Bias is error from wrong assumptions — too rigid a model misses real structure in the data. Variance is error from sensitivity to the particular training sample — too flexible a model fits noise that won’t recur next time. Irreducible noise is the genuine randomness of the thing you’re predicting. In football it’s enormous: a single deflected shot decides a knockout tie. No model touches this term, which is why even the best classifier here sits near 50% accuracy. 偏差是由于错误的假设产生的误差——过于僵化的模型会错过数据中的真实结构。方差是由于对特定训练样本的敏感性产生的误差——过于灵活的模型会拟合那些下次不会再出现的噪声。不可约噪声是你所预测事物固有的随机性。在足球比赛中,这种随机性巨大:一次折射射门就能决定一场淘汰赛的胜负。没有任何模型能处理这一项,这就是为什么这里最好的分类器准确率也只在 50% 左右。
The whole game is the trade between the first two. High-capacity models, such as boosted trees or neural nets, buy low bias by being flexible enough to bend to almost any shape in the data. The bill for that flexibility is variance, and it only comes due when you don’t have enough data to pin the model down. And that’s exactly our situation. 整个博弈的核心就在于前两者之间的权衡。高容量模型(如提升树或神经网络)通过足够的灵活性来拟合数据中的几乎任何形状,从而获得低偏差。这种灵活性的代价是方差,而当你没有足够的数据来约束模型时,这个代价就会显现出来。这正是我们目前的情况。
With 358 examples split across a three-way target, you have roughly 120 matches per class. An XGBoost ensemble, meanwhile, has thousands of effective parameters spread across its trees. There simply isn’t enough signal to discipline all of them, so they latch onto quirks that happen to appear in one cross-validation fold and vanish in the next. That’s textbook overfitting. 在 358 个样本分布在三个目标类别的情况下,每个类别大约只有 120 场比赛。与此同时,XGBoost 集成模型在树中拥有数千个有效参数。根本没有足够的信号来约束所有这些参数,因此它们会抓住那些恰好出现在某一个交叉验证折叠中、而在下一个折叠中又消失的“怪癖”。这就是教科书式的过拟合。