Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
通过广义偏差-方差分解进行预测不确定性估计
A General Bias‑Variance Decomposition for Proper Scoring Rules – Finally! Or: Why your ensemble works, how to build confidence regions in logit space, and what Bregman information really does for uncertainty estimation. 针对适当评分规则(Proper Scoring Rules)的广义偏差-方差分解——终于来了!或者说:为什么你的集成模型(Ensemble)有效,如何在 Logit 空间构建置信区域,以及 Bregman 信息在不确定性估计中究竟起到了什么作用。
If you’ve ever trained a classifier, you’ve heard the mantra: Bias‑variance trade‑off. But look closely – the classical decomposition works for squared error only. What about log‑loss? Brier score? CRPS? For years, we had no general, closed‑form bias‑variance decomposition for strictly proper scoring rules. Until now. 如果你曾训练过分类器,一定听过这句口诀:偏差-方差权衡(Bias-variance trade-off)。但仔细观察会发现,经典的分解仅适用于平方误差。那么对数损失(Log-loss)、Brier 分数或 CRPS 呢?多年来,我们一直缺乏针对严格适当评分规则的通用、闭式偏差-方差分解。直到现在。
In their AISTATS 2023 paper, Gruber & Buettner (PDF) finally fill this gap. And they give us practical tools: Explain ensembles via a law of total Bregman variance. Build confidence regions directly in logit space. Detect out‑of‑distribution inputs better than raw softmax confidence. Let’s dive in. 在他们的 AISTATS 2023 论文中,Gruber 和 Buettner (PDF) 终于填补了这一空白。他们为我们提供了实用的工具:通过全 Bregman 方差定律解释集成模型;直接在 Logit 空间构建置信区域;比原始 Softmax 置信度更有效地检测分布外(OOD)输入。让我们深入了解一下。
The problem: Uncertainty under domain drift
问题:领域偏移下的不确定性
Your model says “cat” with 0.99 probability – but the image is heavily corrupted. You know from Ovadia et al. (2019) that softmax confidence is not reliable under dataset shift. What we need is a variance‑based uncertainty measure that works for any proper loss. And we need a theory that explains why – for example – ensembling always helps. 你的模型以 0.99 的概率预测为“猫”,但图像实际上严重受损。根据 Ovadia 等人 (2019) 的研究,你知道在数据集偏移(Dataset shift)下,Softmax 置信度是不可靠的。我们需要的是一种适用于任何适当损失函数的、基于方差的不确定性度量,并且需要一套理论来解释为什么(例如)集成学习总是有效的。
Missing piece: A general bias‑variance decomposition for strictly proper scoring rules.
缺失的一环:针对严格适当评分规则的广义偏差-方差分解。
Background: Bregman divergences & proper scoring rules 背景:Bregman 散度与适当评分规则
Bregman divergence Given a differentiable convex function $\phi$, the Bregman divergence is $d_\phi(x, y) = \phi(y) - \phi(x) - \langle \nabla \phi(x), y-x \rangle$. Bregman 散度 给定一个可微凸函数 $\phi$,Bregman 散度定义为 $d_\phi(x, y) = \phi(y) - \phi(x) - \langle \nabla \phi(x), y-x \rangle$。
- Example: $\phi(x)=x^2$ gives $d_\phi(x,y)=(x-y)^2$ (squared error).
- Example: $\phi(x)=x\ln x$ gives the KL divergence.
- 示例:$\phi(x)=x^2$ 得到 $d_\phi(x,y)=(x-y)^2$(平方误差)。
- 示例:$\phi(x)=x\ln x$ 得到 KL 散度。
Strictly proper scoring rule A scoring rule $S(P, y)$ is strictly proper if the expected score is maximised only when $P$ equals the true data distribution $Q$. 严格适当评分规则 如果一个评分规则 $S(P, y)$ 的期望得分仅在 $P$ 等于真实数据分布 $Q$ 时达到最大值,则称其为严格适当的。
Common examples:
- Log score: $S(P, y) = \log p(y)$
- Brier score: $S(P, y) = -|\delta_y - P|^2$
- CRPS (continuous ranked probability score) 常见示例:
- 对数得分:$S(P, y) = \log p(y)$
- Brier 分数:$S(P, y) = -|\delta_y - P|^2$
- CRPS(连续分级概率分数)
Every strictly proper scoring rule corresponds to a Bregman divergence generated by the negative entropy $G$ (Ovcharov, 2018). 每一个严格适当的评分规则都对应一个由负熵 $G$ 生成的 Bregman 散度(Ovcharov, 2018)。
The main result: A general bias‑variance decomposition
主要结论:广义偏差-方差分解
Let $\hat{f}$ be a random prediction (e.g., from different training sets), and $Y \sim Q$ the true outcome. Let $S$ be a strictly proper scoring rule with negative entropy $G$, and $G^$ its convex conjugate. 设 $\hat{f}$ 为随机预测(例如来自不同的训练集),$Y \sim Q$ 为真实结果。设 $S$ 为具有负熵 $G$ 的严格适当评分规则,$G^$ 为其凸共轭。
Theorem (Gruber & Buettner, 2023): $\mathbb{E}[-S(\hat{f}, Y)] = H(Q) + B_{G^}[\mathbb{E}[\hat{f}]] + \mathbb{E}[d_{G^}(\mathbb{E}[\hat{f}], \hat{f})]$ (Note: The original text contained a LaTeX parsing error; the above is the standard form of the decomposition.)
What does each term mean?
- $B_{G^*}[X]$ – Bregman information (generalised variance). For $\phi(x)=x^2$, $B_\phi[X] = \mathrm{Var}(X)$.
- $d_{G^*, S^{-1}}$ – Bregman divergence in the dual space – that’s the squared bias. So the classical MSE decomposition ($\text{error} = \text{noise} + \text{var} + \text{bias}^2$) is a special case of this theorem. 每一项的含义是什么?
- $B_{G^*}[X]$ —— Bregman 信息(广义方差)。对于 $\phi(x)=x^2$,有 $B_\phi[X] = \mathrm{Var}(X)$。
- $d_{G^*, S^{-1}}$ —— 对偶空间中的 Bregman 散度 —— 即平方偏差。 因此,经典的 MSE 分解(误差 = 噪声 + 方差 + 偏差²)只是该定理的一个特例。
Special case: Classification (logit space) – this is huge
特例:分类(Logit 空间)——这非常重要
Let $\hat{z} \in \mathbb{R}^k$ be the logits (before softmax). Let $\text{sm}(z)$ be the softmax probabilities. Use the negative log‑likelihood (log loss) as scoring rule. 设 $\hat{z} \in \mathbb{R}^k$ 为 Logit(Softmax 之前)。设 $\text{sm}(z)$ 为 Softmax 概率。使用负对数似然(Log loss)作为评分规则。
Corollary: $\mathbb{E}[-\ln \text{sm}Y(\hat{z})] = H(Q) + B{\text{LSE}}[\hat{z}] + d_{\text{LSE}}(\text{sm}^{-1}(Q), \mathbb{E}[\hat{z}])$ where $\text{LSE}(x) = \ln \sum e^{x_i}$ (LogSumExp). 推论: $\mathbb{E}[-\ln \text{sm}Y(\hat{z})] = H(Q) + B{\text{LSE}}[\hat{z}] + d_{\text{LSE}}(\text{sm}^{-1}(Q), \mathbb{E}[\hat{z}])$ 其中 $\text{LSE}(x) = \ln \sum e^{x_i}$(LogSumExp)。
Why is this surprising? The variance term $B_{\text{LSE}}[\hat{z}]$ is computed directly on the logits, without applying softmax. No normalisation to probabilities needed – numerically stable and conceptually clean. This is perfect for deep neural networks: To estimate predictive uncertainty, just compute the Bregman information of the logits over an ensemble or multiple forward passes. 为什么这令人惊讶?方差项 $B_{\text{LSE}}[\hat{z}]$ 直接在 Logit 上计算,无需应用 Softmax。不需要归一化为概率——数值上更稳定,概念上也更清晰。这对于深度神经网络非常完美:要估计预测不确定性,只需计算集成模型或多次前向传播中 Logit 的 Bregman 信息即可。