Smart Ensemble Learning Framework for Predicting Groundwater Heavy Metal Pollution

用于预测地下水重金属污染的智能集成学习框架

Abstract: Groundwater in the Densu Basin is increasingly threatened by heavy metal contamination, but conventional methods fail to capture the statistical complexity and spatial heterogeneity of pollution indicators. A key challenge is modelling the Heavy Metal Pollution Index (HPI), which is typically skewed and affected by correlated contaminants, leading to biased predictions without transformation.

摘要： 登苏盆地（Densu Basin）的地下水正日益受到重金属污染的威胁，但传统方法难以捕捉污染指标的统计复杂性和空间异质性。建模重金属污染指数（HPI）是一个关键挑战，该指数通常呈偏态分布且受相关污染物影响，若不进行转换，会导致预测偏差。

This study develops a predictive framework integrating response transformations with nested cross-validated ensemble machine learning. Three transformations (raw, log, and Gaussian copula) were applied to HPI and evaluated across six learners: support vector regression (SVM), $k$-nearest neighbours (k-NN), CART, Elastic Net, kernel ridge regression, and a stacked Lasso ensemble.

本研究开发了一个预测框架，将响应转换与嵌套交叉验证的集成机器学习相结合。研究对 HPI 应用了三种转换（原始数据、对数转换和高斯 Copula），并在六种学习器上进行了评估：支持向量回归（SVM）、$k$-近邻算法（k-NN）、CART 决策树、弹性网络（Elastic Net）、核岭回归以及堆叠 Lasso 集成模型。

Raw-scale models produced deceptively high fits (Elastic Net and stacked ensemble $R^2 \approx 1.0$), suggesting over-optimism. The log transformation stabilised variance (SVM: $R^2 = 0.93$, RMSE $= 0.18$; k-NN: $R^2 = 0.92$, RMSE $= 0.20$). The Gaussian copula gave the most reliable results: stacked ensemble $R^2 = 0.96$ (RMSE $= 0.19$), with other learners maintaining high accuracy.

原始尺度模型产生了虚高的拟合度（弹性网络和堆叠集成模型的 $R^2 \approx 1.0$），这表明存在过度乐观的倾向。对数转换稳定了方差（SVM：$R^2 = 0.93$，RMSE $= 0.18$；k-NN：$R^2 = 0.92$，RMSE $= 0.20$）。高斯 Copula 方法给出了最可靠的结果：堆叠集成模型 $R^2 = 0.96$（RMSE $= 0.19$），其他学习器也保持了高精度。

Copula-based models improved residuals and produced spatially plausible maps. DBSCAN clustering revealed Fe and Mn as primary HPI contributors, consistent with regional hydrogeochemistry. Limitations include reliance on random (not spatial) cross-validation and basin-specific scope. Future work should explore spatial validation and other geological settings. Overall, distribution-aware ensembles with clustering diagnostics offer robust, interpretable assessments of groundwater contamination.

基于 Copula 的模型改善了残差，并生成了在空间上合理的分布图。DBSCAN 聚类分析显示，铁（Fe）和锰（Mn）是 HPI 的主要贡献因子，这与区域水文地球化学特征一致。本研究的局限性在于依赖随机（而非空间）交叉验证，且研究范围仅限于特定盆地。未来的工作应探索空间验证及其他地质环境。总体而言，结合聚类诊断的分布感知集成模型为地下水污染评估提供了稳健且可解释的方法。