Your Outlier Detection is Lying to You

Your Outlier Detection is Lying to You

你的异常值检测在欺骗你

Why DBSCAN breaks in high dimensions and what to do instead. You tuned epsilon to 1.5 because it felt reasonable. Here is what that decision actually means. On a dataset with 16 features, shifting epsilon from 1.0 to 2.0 changes your outlier rate from 60.31% to 2.35%. Same data. Same algorithm. One decimal point of difference. These are not numbers from a toy dataset: they come from a decade of real Australian weather records, 145,000 observations, 16 continuous meteorological variables. If someone asked you to justify eps=1.5 in a production review, what would you say? 为什么 DBSCAN 在高维空间中会失效,以及该如何应对。你将 epsilon 调整为 1.5,仅仅是因为它看起来“合理”。但这个决定实际上意味着什么呢?在一个拥有 16 个特征的数据集上,将 epsilon 从 1.0 调整到 2.0,会导致你的异常值比例从 60.31% 剧变至 2.35%。同样的数据,同样的算法,仅仅是一个小数点的差异。这些数字并非来自玩具数据集,而是源自十年的真实澳大利亚气象记录,包含 145,000 条观测数据和 16 个连续气象变量。如果在生产环境评审中有人让你证明 eps=1.5 的合理性,你会怎么回答?

The Setup

准备工作

The dataset is the Australian weather observations from the Bureau of Meteorology, publicly available on Kaggle. It contains daily measurements from 49 stations across the country: temperature, rainfall, wind speed, pressure, humidity. Real data, messy data, with missing values and a distribution that does not care about your assumptions. The preprocessing is standard. Select numerical columns, impute missing values with the column median, and scale everything with StandardScaler. Sixteen features survive the selection. 该数据集是来自澳大利亚气象局的气象观测数据,可在 Kaggle 上公开获取。它包含了全国 49 个气象站的每日测量数据:温度、降雨量、风速、气压和湿度。这是真实的、杂乱的数据,包含缺失值,且其分布完全不理会你的预设假设。预处理过程很标准:选择数值列,用列中位数填充缺失值,并使用 StandardScaler 进行标准化。最终筛选出 16 个特征。

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

df = pd.read_csv("weatherAUS.csv")
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
imputer = SimpleImputer(strategy='median')
df_num_imputed = pd.DataFrame(
    imputer.fit_transform(df[num_cols]), columns=num_cols
)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_num_imputed)

print(f"Total rows: {len(df_scaled)} | Dimensions: {len(num_cols)}")
# Total rows: 145460 | Dimensions: 16

Nothing unusual so far. This is the pipeline you have probably written a dozen times. The problem starts at the next step. 到目前为止一切正常。这大概是你已经写过无数次的流水线。问题从下一步开始。

Why DBSCAN Cannot Handle This

为什么 DBSCAN 无法处理这种情况

DBSCAN defines a point as an outlier if no other point falls within a radius of epsilon in the feature space. The logic is intuitive in two or three dimensions. In sixteen dimensions it stops making geometric sense. The reason is the curse of dimensionality. As dimensions increase Euclidean distances between points concentrate. The ratio between the maximum and minimum distance across all point pairs converges toward one. In practice this means that in a high-dimensional space all points start to look roughly equidistant from each other. The notion of a dense neighborhood that DBSCAN relies on becomes increasingly difficult to define and the choice of epsilon loses its geometric interpretation. DBSCAN 将一个点定义为异常值,前提是特征空间中没有其他点落在以 epsilon 为半径的范围内。这种逻辑在二维或三维空间中很直观,但在十六维空间中,它就失去了几何意义。原因在于“维度灾难”:随着维度的增加,点之间的欧几里得距离会趋于集中。所有点对之间的最大距离与最小距离之比趋近于 1。在实践中,这意味着在高维空间中,所有点看起来都大致等距。DBSCAN 所依赖的“稠密邻域”概念变得越来越难以定义,epsilon 的选择也失去了其几何解释力。

from sklearn.cluster import DBSCAN
eps_values = [1.0, 1.5, 2.0]
outlier_counts = []

for eps in eps_values:
    dbscan = DBSCAN(eps=eps, min_samples=4, n_jobs=-1)
    labels = dbscan.fit_predict(df_scaled)
    n_outliers = np.sum(labels == -1)
    pct = (n_outliers / len(df_scaled)) * 100
    outlier_counts.append(pct)
    print(f"DBSCAN eps={eps}: {n_outliers} outlier ({pct:.2f}%)")

# Output:
# DBSCAN eps=1.0: 87720 outlier (60.31%)
# DBSCAN eps=1.5: 18166 outlier (12.49%)
# DBSCAN eps=2.0: 3423 outlier (2.35%)

That is the structural problem. You are not making a calibration decision. You are making an arbitrary choice that determines whether your pipeline discards 87,000 rows or 3,400 rows and you have no principled way to defend either number. 这就是结构性问题。你做的不是校准决策,而是一个随意的选择。这个选择决定了你的流水线是丢弃 87,000 行数据还是 3,400 行数据,而你却没有任何原则性的依据来为这两个数字辩护。

The Paradigm Shift: Isolation Over Distance

范式转移:用“隔离”代替“距离”

Isolation Forest does not use distances. It builds an ensemble of random decision trees and for each tree it randomly selects a feature and a split value within the feature range. A point is considered anomalous if it gets isolated near the root of the tree, meaning very few splits were needed to separate it from the rest of the data. This matters because anomalies are by definition rare and different. A truly anomalous point sits in a sparse region of the feature space and is easy to isolate with just a few random cuts. A normal point lives in a dense cluster and requires many cuts to separate. The algorithm exploits this structural property without ever computing a distance. The practical consequence is that Isolation Forest does not suffer from the concentration of distances that kills DBSCAN in high dimensions. Each split operates on a single feature so the geometric complexity does not scale with the number of dimensions in the same catastrophic way. 孤立森林(Isolation Forest)不使用距离。它构建了一个随机决策树集成,对于每棵树,它随机选择一个特征并在该特征范围内随机选择一个分割值。如果一个点在树的根部附近就被孤立,即只需要很少的分割就能将其与其余数据分开,那么它就被认为是异常的。这一点很重要,因为异常值按定义就是稀有且独特的。一个真正的异常点位于特征空间的稀疏区域,很容易通过几次随机切割将其孤立;而正常点位于稠密簇中,需要多次切割才能分离。该算法利用了这一结构特性,而无需计算任何距离。实际结果是,孤立森林不会受到在高维空间中摧毁 DBSCAN 的“距离集中”问题的影响。每次分割仅作用于单个特征,因此几何复杂度不会随着维度数量的增加而灾难性地增长。

from sklearn.ensemble import IsolationForest

# For meteorological data, ~5% of anomalous events is a reasonable estimate
# based on domain knowledge. This is not a magic number: it is a claim
# you can argue in front of a domain expert.
CONTAMINATION = 0.05
iso = IsolationForest(contamination=CONTAMINATION, random_state=42, n_jobs=-1)
iso.fit(df_scaled)
anomaly_scores = iso.decision_function(df_scaled)
predictions = iso.predict(df_scaled)
df['Anomaly_Score'] = anomaly_scores
df['Is_Anomaly'] = (predictions == -1)

Notice what changed conceptually. With DBSCAN you were choosing a geometric radius with no interpretable meaning in 16 dimensions. With Isolation Forest you are choosing a contamination rate, a domain assumption you can state explicitly. You can argue that you expect approximately 5 percent of these observations to be genuine meteorological anomalies. That is a claim you can bring to a domain expert or a code reviewer. An epsilon of 1.5 is not. 注意概念上的变化。使用 DBSCAN 时,你选择的是一个在 16 维空间中毫无可解释意义的几何半径;而使用孤立森林时,你选择的是“污染率”(contamination rate),这是一个你可以明确陈述的领域假设。你可以论证说,你预期这些观测数据中大约有 5% 是真正的气象异常。这是一个你可以向领域专家或代码审查员提出的论点,而 epsilon=1.5 则做不到这一点。

The Sensitivity Problem Has Not Disappeared

敏感性问题并未消失

Here is something that deserves honesty. Isolation Forest does not eliminate parameter sensitivity. It relocates it to a space where the sensitivity is at least interpretable. 有一点必须诚实说明:孤立森林并没有消除参数敏感性,它只是将其转移到了一个至少具有可解释性的空间中。

print("--- Threshold sensitivity in Isolation Forest ---")
for threshold in [-0.10, -0.05, 0.00, 0.05]:
    n = np.sum(anomaly_scores < threshold)
    print(f" Threshold {threshold:+.2f}: {n} outlier ({(n/len(df))*100:.2f}%)")

# Output:
# Threshold -0.10: 123 outlier (0.08%)
# Threshold -0.05: 1405 outlier (0.97%)
# Threshold +0.00: 7273 outlier (5.00%)
# Threshold +0.05: 28844 outlier (19.83%)

The range from 123 to 28,844 outliers is still dramatic. The difference from the DBSCAN case is that each of these thresholds maps to a falsifiable claim about the data. 从 123 到 28,844 个异常值的范围依然巨大。但与 DBSCAN 的区别在于,这些阈值中的每一个都对应着一个关于数据的、可证伪的论断。