SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification
SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification
SemiScope:解构半监督安全分类中的分类器调优与联合优化
Abstract: 摘要:
Background. Labeled data for security classification is scarce. Semi-supervised learning (SSL) propagates labels from a small labeled pool to larger unlabeled pools. Yet security applications often use SSL as a black box: default parameters, a fixed classifier, and no handling of pseudo-label-induced class imbalance. 背景:安全分类所需的标注数据十分稀缺。半监督学习(SSL)通过将标签从少量标注池传播到更大的未标注池来解决这一问题。然而,安全应用通常将 SSL 视为“黑盒”:使用默认参数、固定的分类器,且不对伪标签引起的类别不平衡进行处理。
Aims. Recent work reports sizeable gains from optimizing SSL pipelines via joint search, AutoML, or per-component tuning. These gains are hard to attribute: they may reflect useful SSL-classifier interactions, or mostly from simply tuning the downstream classifier. We disentangle these effects for binary tabular security data with classical SSL and tree-based classifiers. 目的:近期研究表明,通过联合搜索、AutoML 或组件级调优来优化 SSL 流水线可以获得显著收益。但这些收益难以归因:它们可能反映了有效的 SSL 与分类器之间的交互,也可能仅仅源于对下游分类器的调优。我们针对使用经典 SSL 和树模型分类器的二元表格安全数据,对这些效应进行了拆解分析。
Method. We build SemiScope as an analysis instrument, not a deployment recommendation. It uses Bayesian Optimization to jointly tune SSL settings, confidence filtering, oversampling, and the classifier. The key control, Tuned-Clf, fixes SSL to defaults but gets the same 100-trial classifier budget and validation-set threshold tuning as SemiScope. At 10% labels, we compare them with paired TOST using a +/-1.0 g-measure smallest effect of interest. 方法:我们构建了 SemiScope 作为一种分析工具,而非部署建议。它利用贝叶斯优化来联合调优 SSL 设置、置信度过滤、过采样以及分类器。关键对照组 Tuned-Clf 将 SSL 固定为默认设置,但与 SemiScope 享有相同的 100 次分类器试验预算和验证集阈值调优。在 10% 标注率下,我们使用配对 TOST(双单侧检验)进行比较,并设定 +/-1.0 g-measure 作为最小感兴趣效应量。
Results. SemiScope beats every default SSL baseline on all five datasets, improving over the strongest by 0.7-12.7 points. Under the equal-budget control, Tuned-Clf is statistically equivalent to the full pipeline on 4 of 5 datasets; Phishing is inconclusive. Classifier HPO alone recovers a median 86% of SemiScope’s gain over Default Self-Training (ST) + Random Forest (RF). 结果:SemiScope 在所有五个数据集上均优于默认的 SSL 基线,较最强基线提升了 0.7-12.7 个百分点。在同等预算对照下,Tuned-Clf 在 5 个数据集中的 4 个上与完整流水线在统计学上等效;Phishing 数据集的结果则不确定。仅进行分类器超参数优化(HPO)即可恢复 SemiScope 相比于“默认自训练(ST)+ 随机森林(RF)”所获收益的中位数 86%。
Conclusions. The reusable contribution is the decomposition protocol. A simpler recipe suffices: use Self-Training, tune the classifier with Bayesian Optimization, and tune the decision threshold on validation data. It reaches within 1 g-measure of Supervised RF at 20-30% labels on four datasets and 40% on Drebin, at the same or lower label rate than Default ST + RF on every dataset. 结论:本研究可复用的贡献在于其分解协议。一个更简单的方案足矣:使用自训练(Self-Training),通过贝叶斯优化调优分类器,并在验证集上调优决策阈值。该方案在四个数据集上达到 20-30% 标注率时,以及在 Drebin 数据集达到 40% 标注率时,其 g-measure 指标与监督学习 RF 的差距在 1 以内,且在所有数据集上的标注率均不高于默认的 ST + RF。