Few-Shot Resampling for Scalable Statistically-Sound Data Mining
Few-Shot Resampling for Scalable Statistically-Sound Data Mining
用于可扩展且统计稳健的数据挖掘的少样本重采样技术
Abstract: A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data.
摘要: 知识发现的一个关键步骤是评估数据挖掘结果。在包括模式挖掘、图分析等多种应用中,这一步骤包含对结果统计显著性的评估,以避免因数据中的噪声或随机波动而导致的虚假发现。
While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses.
尽管针对某些特定应用已经开发了专门的程序,但基于重采样的方法仍被广泛使用,特别是在无法推导出解析结果的复杂分析中。然而,当前的重采样方法需要生成并分析数以千计的重采样数据集,因此对于大规模数据集或计算密集型分析而言,这些方法并不实用。
In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied.
在本文中,我们引入了 FewRS,这是一种简单且有效的基于重采样的方法,用于评估数据挖掘结果的统计显著性,并对错误发现的概率提供了严格的保证。我们的方法适用于所有使用重采样方法的场景。
FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability.
FewRS 基于我们推导出的一个新界限,该界限针对代表数据挖掘结果质量的检验统计量的上确界偏差。我们证明了 FewRS 仅需生成和分析极少量的重采样数据集,从而实现了一种具有广泛适用性的高可扩展方法。
We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets.
我们在模式挖掘和网络分析等常见任务上测试了我们的方法。在所有案例中,与现有最先进技术相比,我们的方法将运行时间缩短了多达两个数量级,同时保持了高统计功效,从而实现了对大规模真实世界数据集的数据挖掘结果进行统计验证。