DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models
DisaBench:针对语言模型中残障伤害的参与式评估框架
Abstract: General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs.
摘要: 通用的大语言模型安全基准测试无法充分评估与残障相关的伤害。我们推出了 DisaBench:这是一个由残障人士和红队专家共同创建的包含十二类残障伤害的分类法;一套基于该分类法的评估方法,涵盖了七个生活领域中的良性与对抗性提示词对;以及一个包含 175 个提示词的数据集,其中包含 525 对经人工标注的提示-响应对。
Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize.
由四位具有残障生活经验的评估者进行的标注揭示了三个发现:伤害率因残障类型而异,且在非文本模态中会加剧;基于术语的伤害具有文化和时间局限性,而非普适性的评估对象;标准的安全性评估能够捕捉明显的故障,却往往忽略了只有领域专业知识才能识别的隐性伤害。
Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.
残障伤害同时具有个人性、交叉性和社区定义性:它无法从一个人的完整背景中剥离出来,而通用基准测试系统性地忽略了这一点。我们将通过 Hugging Face 发布该数据集、分类法和方法论,并提供一个开源的红队测试框架,以便在无需额外基础设施的情况下直接集成到现有的安全流程中。