$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

$ECUAS_n$：用于不确定性增强系统原则性评估的指标族

Abstract: In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users — human or downstream systems — to accept or reject predictions based on application-specific cost trade-offs.

摘要： 在高风险的自动化决策中，获取预测不确定性对于使用户（人类或下游系统）能够根据特定应用的成本权衡来接受或拒绝预测至关重要。

Such uncertainty-augmented (UA) systems — i.e., systems that output both predictions and uncertainty scores — are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve.

此类不确定性增强（UA）系统（即同时输出预测结果和不确定性评分的系统）目前在文献中通过多种方式进行评估，例如使用独立的指标分别评估预测结果和不确定性评分、设定具有固定拒绝成本的成本函数，或对覆盖率-风险曲线进行积分。

We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, $ECUAS_n$, formulated as proper scoring rules for the task of interest.

我们认为，这些评估方法不足以衡量 UA 系统在不确定性环境下进行决策时的整体性能，因此提出了一种新的指标族 $ECUAS_n$，将其表述为针对特定任务的适当评分规则（proper scoring rules）。

The parameter $n$ controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case.

参数 $n$ 根据用例的需求，控制了错误预测成本与不完美不确定性之间的权衡。

We demonstrate the advantages of the $ECUAS_n$ metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.

我们通过在多种分类和生成数据集（包括 TriviaQA 的人工标注子集）上的实验，从理论和实证两方面证明了 $ECUAS_n$ 指标的优势。