Which Regularizer Should You Actually Use? Lessons from 134,400 Simulations

你到底该用哪种正则化器？来自 134,400 次模拟的经验教训

A practitioner’s decision framework for Ridge, Lasso, and ElasticNet based on three quantities you can compute before fitting a model. 这是一份基于模型拟合前即可计算的三个指标，为 Ridge、Lasso 和 ElasticNet 提供决策参考的实践指南。

Ridge, Lasso, or ElasticNet? We ran 134,400 simulations grounded in real production ML models to find out. The answer depends on what you’re optimizing for, and on a single diagnostic you can compute before fitting a model. Ridge、Lasso 还是 ElasticNet？我们基于真实的生产环境机器学习模型进行了 134,400 次模拟以寻找答案。答案取决于你的优化目标，以及一个在模型拟合前即可计算的诊断指标。

If you’ve ever trained a linear model in scikit-learn, you’ve faced this question: RidgeCV, LassoCV, or ElasticNetCV? Maybe you defaulted to whatever a tutorial recommended. Maybe a colleague had a strong opinion. Maybe you tried all three and picked whichever gave the best cross-validation score. 如果你曾在 scikit-learn 中训练过线性模型，一定面临过这个问题：RidgeCV、LassoCV 还是 ElasticNetCV？也许你只是默认选择了教程推荐的方法，也许你的同事有强烈的偏好，又或者你尝试了全部三种，并选择了交叉验证得分最高的那一个。

We wanted to replace intuition with empirical decision-making. We ran 134,400 simulations across 960 configurations of a 7-dimensional parameter space, varying sample size, features, multicollinearity, signal-to-noise ratio, coefficient sparsity, and two more parameters. 我们希望用实证决策来取代直觉。我们在 7 维参数空间的 960 种配置下运行了 134,400 次模拟，变量包括样本量、特征数、多重共线性、信噪比、系数稀疏度以及另外两个参数。

We benchmarked four regularization frameworks (Ridge, Lasso, ElasticNet, and Post-Lasso OLS) across the three objectives: Predictive accuracy (test RMSE), Variable selection (F1 score for recovering the true feature set), and Coefficient estimation (L2 error vs. true coefficients). 我们针对三个目标对四种正则化框架（Ridge、Lasso、ElasticNet 和 Post-Lasso OLS）进行了基准测试：预测准确性（测试集 RMSE）、变量选择（恢复真实特征集的 F1 分数）以及系数估计（与真实系数的 L2 误差）。

Our simulation ranges aren’t arbitrary. They’re grounded in eight real-world production ML models from Instacart, spanning demand forecasting, conversion prediction, and inventory intelligence. The regimes we tested reflect conditions that MLEs actually encounter in practice. 我们的模拟范围并非随意设定，而是基于 Instacart 的八个真实生产环境机器学习模型，涵盖了需求预测、转化预测和库存智能。我们测试的方案反映了机器学习工程师在实践中真正遇到的情况。

The Headlines

核心结论

Before we get into the details: 在深入细节之前，先看结论：

For prediction, it barely matters. Ridge, Lasso, and ElasticNet differ by at most 0.3% in median RMSE. No hyperparameter achieves even a small effect size for RMSE differences among them. This only holds with adequate training data (> 78 observations per feature). 对于预测而言，差别微乎其微。 Ridge、Lasso 和 ElasticNet 的中位数 RMSE 差异最多不超过 0.3%。没有任何超参数能对它们之间的 RMSE 差异产生显著影响。前提是训练数据充足（每个特征对应的观测值 > 78 个）。
For variable selection, it matters enormously, especially under multicollinearity. Lasso’s recall collapses to 0.18 under high condition numbers with low signal, while ElasticNet maintains 0.93. 对于变量选择而言，影响巨大，尤其是在存在多重共线性的情况下。在高条件数且低信号的情况下，Lasso 的召回率会降至 0.18，而 ElasticNet 仍能保持 0.93。
At large sample-to-feature ratios (n/p ≥ 78), the methods become interchangeable. Use Ridge; it’s the fastest. 在样本与特征比率较大（n/p ≥ 78）时，这些方法可以互换。请使用 Ridge，它是最快的。
Post-Lasso OLS should be avoided when optimizing for RMSE. It’s the only method that consistently underperforms, and it does so on every objective we measured. 在优化 RMSE 时应避免使用 Post-Lasso OLS。 它是唯一表现持续不佳的方法，且在我们测量的所有目标上表现均如此。

Finding 1: For Prediction, Just Use Ridge

发现 1：为了预测，直接用 Ridge

This is the most important finding for the largest number of practitioners. Ridge, Lasso, and ElasticNet are nearly interchangeable for prediction. Across all 33,600 simulations per method, the median test RMSE differs by at most 0.3%. 对于大多数从业者来说，这是最重要的发现。在预测方面，Ridge、Lasso 和 ElasticNet 几乎可以互换。在每种方法的所有 33,600 次模拟中，中位数测试 RMSE 的差异最多仅为 0.3%。

So why Ridge? Computational efficiency. Ridge has a closed-form solution for each candidate α, making it dramatically faster than the alternatives (compare Ridge’s median run time of 6 seconds to Lasso’s median runtime of 9 seconds and ElasticNet’s median runtime of 48 seconds). 那么为什么要选 Ridge？因为计算效率。Ridge 对每个候选 α 都有闭式解，这使其比其他方法快得多（Ridge 的中位数运行时间为 6 秒，而 Lasso 为 9 秒，ElasticNet 为 48 秒）。

ElasticNet’s overhead stems from its joint grid search over α and the L1 ratio ρ. The 167–219× mean overhead we measured is specific to our 8-value L1 ratio grid. A coarser 3-value grid would reduce this proportionally. Even worse, when the coefficient distribution is approximately uniform, Lasso can take over an hour to converge. This overhead buys you a median RMSE improvement of just 0.04% over Ridge, a margin that’s negligible in practice. ElasticNet 的开销源于其对 α 和 L1 比率 ρ 的联合网格搜索。我们测得的 167–219 倍平均开销是针对我们 8 值 L1 比率网格而言的。更粗糙的 3 值网格会按比例减少开销。更糟糕的是，当系数分布近似均匀时，Lasso 可能需要一个多小时才能收敛。这种开销换来的中位数 RMSE 提升仅比 Ridge 高出 0.04%，在实践中这一差距可以忽略不计。