$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
$ECUAS_n$:用于不确定性增强系统原则性评估的指标族
Abstract: In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users — human or downstream systems — to accept or reject predictions based on application-specific cost trade-offs.
摘要: 在高风险的自动化决策中,获取预测不确定性对于使用户(人类或下游系统)能够根据特定应用的成本权衡来接受或拒绝预测至关重要。
Such uncertainty-augmented (UA) systems — i.e., systems that output both predictions and uncertainty scores — are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve.
此类不确定性增强(UA)系统(即同时输出预测结果和不确定性评分的系统)目前在文献中通过多种方式进行评估,例如使用独立的指标分别评估预测结果和不确定性评分、设定具有固定拒绝成本的成本函数,或对覆盖率-风险曲线进行积分。
We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, $ECUAS_n$, formulated as proper scoring rules for the task of interest.
我们认为,这些评估方法不足以衡量 UA 系统在不确定性环境下进行决策时的整体性能,因此提出了一种新的指标族 $ECUAS_n$,将其表述为针对特定任务的适当评分规则(proper scoring rules)。
The parameter $n$ controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case.
参数 $n$ 根据用例的需求,控制了错误预测成本与不完美不确定性之间的权衡。
We demonstrate the advantages of the $ECUAS_n$ metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.
我们通过在多种分类和生成数据集(包括 TriviaQA 的人工标注子集)上的实验,从理论和实证两方面证明了 $ECUAS_n$ 指标的优势。