SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification

SemiScope：解构半监督安全分类中的分类器调优与联合优化

Abstract: 摘要：

Background. Labeled data for security classification is scarce. Semi-supervised learning (SSL) propagates labels from a small labeled pool to larger unlabeled pools. Yet security applications often use SSL as a black box: default parameters, a fixed classifier, and no handling of pseudo-label-induced class imbalance. 背景：安全分类所需的标注数据十分稀缺。半监督学习（SSL）通过将标签从少量标注池传播到更大的未标注池来解决这一问题。然而，安全应用通常将 SSL 视为“黑盒”：使用默认参数、固定的分类器，且不对伪标签引起的类别不平衡进行处理。

Aims. Recent work reports sizeable gains from optimizing SSL pipelines via joint search, AutoML, or per-component tuning. These gains are hard to attribute: they may reflect useful SSL-classifier interactions, or mostly from simply tuning the downstream classifier. We disentangle these effects for binary tabular security data with classical SSL and tree-based classifiers. 目的：近期研究表明，通过联合搜索、AutoML 或组件级调优来优化 SSL 流水线可以获得显著收益。但这些收益难以归因：它们可能反映了有效的 SSL 与分类器之间的交互，也可能仅仅源于对下游分类器的调优。我们针对使用经典 SSL 和树模型分类器的二元表格安全数据，对这些效应进行了拆解分析。

Method. We build SemiScope as an analysis instrument, not a deployment recommendation. It uses Bayesian Optimization to jointly tune SSL settings, confidence filtering, oversampling, and the classifier. The key control, Tuned-Clf, fixes SSL to defaults but gets the same 100-trial classifier budget and validation-set threshold tuning as SemiScope. At 10% labels, we compare them with paired TOST using a +/-1.0 g-measure smallest effect of interest. 方法：我们构建了 SemiScope 作为一种分析工具，而非部署建议。它利用贝叶斯优化来联合调优 SSL 设置、置信度过滤、过采样以及分类器。关键对照组 Tuned-Clf 将 SSL 固定为默认设置，但与 SemiScope 享有相同的 100 次分类器试验预算和验证集阈值调优。在 10% 标注率下，我们使用配对 TOST（双单侧检验）进行比较，并设定 +/-1.0 g-measure 作为最小感兴趣效应量。

Results. SemiScope beats every default SSL baseline on all five datasets, improving over the strongest by 0.7-12.7 points. Under the equal-budget control, Tuned-Clf is statistically equivalent to the full pipeline on 4 of 5 datasets; Phishing is inconclusive. Classifier HPO alone recovers a median 86% of SemiScope’s gain over Default Self-Training (ST) + Random Forest (RF). 结果：SemiScope 在所有五个数据集上均优于默认的 SSL 基线，较最强基线提升了 0.7-12.7 个百分点。在同等预算对照下，Tuned-Clf 在 5 个数据集中的 4 个上与完整流水线在统计学上等效；Phishing 数据集的结果则不确定。仅进行分类器超参数优化（HPO）即可恢复 SemiScope 相比于“默认自训练（ST）+ 随机森林（RF）”所获收益的中位数 86%。

Conclusions. The reusable contribution is the decomposition protocol. A simpler recipe suffices: use Self-Training, tune the classifier with Bayesian Optimization, and tune the decision threshold on validation data. It reaches within 1 g-measure of Supervised RF at 20-30% labels on four datasets and 40% on Drebin, at the same or lower label rate than Default ST + RF on every dataset. 结论：本研究可复用的贡献在于其分解协议。一个更简单的方案足矣：使用自训练（Self-Training），通过贝叶斯优化调优分类器，并在验证集上调优决策阈值。该方案在四个数据集上达到 20-30% 标注率时，以及在 Drebin 数据集达到 40% 标注率时，其 g-measure 指标与监督学习 RF 的差距在 1 以内，且在所有数据集上的标注率均不高于默认的 ST + RF。