MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

MCBench：面向全能大语言模型的多语境安全评估基准

Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment.

摘要： 现有的多模态安全基准仅关注视觉输入，无法评估能够处理视觉、音频和文本的全能大语言模型（Omni LLMs）。我们推出了 MCBench，这是一个包含 1196 个场景的基准测试，涵盖了四个安全类别，需要整合多种模态才能进行准确的安全评估。

Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present.

每个不安全场景都配有一个差异极小的安全对应场景，以评估模型的敏感度。我们对当前最先进模型的评估揭示了重大挑战。全能大语言模型在处理细微或非物理风险时表现吃力，但在存在显著视觉或听觉线索时表现较好。

Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

对推理轨迹的分析表明，尽管模型能够提取特定模态的信息，但它们往往无法有效地整合这些线索以做出安全判断。我们的研究结果表明，当前的全能大语言模型在安全关键场景中缺乏稳健的跨模态推理能力，这凸显了改进多模态安全架构和训练策略的必要性。

Paper Details:

Authors: Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Date: 17 Apr 2026

论文详情：

作者： Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung
学科： 计算与语言 (cs.CL)；人工智能 (cs.AI)；音频与语音处理 (eess.AS)
日期： 2026年4月17日