SwordBench: Evaluating Orthogonality of Steering Image Representations

SwordBench：评估图像表征引导的正交性

Abstract: Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks.

摘要： 在推理阶段对模型表征进行引导或干预以修正预测，对于人工智能的可解释性和安全性至关重要，然而现有的评估协议仅限于模糊的语言建模任务。为了填补这一空白，我们引入了 SwordBench，这是一个用于评估视觉模型在多种骨干网络和概念移除任务中图像表征引导能力的基准测试。

Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative概念, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias.

除了统一的基准测试套件外，我们还提出了新的评估概念，旨在揭示概念激活向量之间正交化对实际引导产生的二阶效应。具体而言，“跨概念鲁棒性”（cross-concept robustness）衡量了在针对其他概念进行正交化处理后，概念检测性能的稳定性；而“附带损害”（collateral damage）则量化了在处理不含偏差的输入时，引导操作是否会无意中影响模型在下游任务中的表现。

We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

我们发现，尽管线性支持向量机（SVM）表现出优异的可分性和正交性，但它无法实现零附带损害，其表现往往落后于稀疏自动编码器（SAE）。在更简单的场景下，标准基线方法和基于优化的方法均未能实现完美的引导。源代码即将发布在 GitHub 上。