ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

ARMOR 2025：一个超越民用范畴、用于评估大语言模型军事安全性的基准测试

Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations.

大语言模型（LLM）目前正被探索用于需要可靠且符合法律规定的决策支持的国防应用中。它们在增强军事环境下的决策、协调和作战效率方面也具有巨大潜力。这些用途要求评估方法必须能够反映指导实际军事行动的条令标准。

Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines: the Law of War, the Rules of Engagement, and the Joint Ethics Regulation.

现有的安全基准测试主要关注一般性社会风险，并未测试模型是否遵循管理实际军事行动的法律和道德准则。为了填补这一空白，我们推出了 ARMOR 2025，这是一个基于三项核心军事条令的军事对齐安全基准：战争法、交战规则和联合道德条例。

We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized through a taxonomy informed by the Observe-Orient-Decide-Act (OODA) decision-making framework. This structure enables systematic testing of accuracy and refusal across military-relevant decision types.

我们从这些来源中提取条令文本，并生成了保留每项规则原意的多项选择题。该基准通过基于“观察-调整-决策-行动”（OODA）决策框架的分类法进行组织。这种结构能够针对各类军事相关决策，对模型的准确性和拒绝回答能力进行系统性测试。

This benchmark features a structured 12-category taxonomy, 519 doctrinally grounded prompts, and rigorous evaluation procedures applied to 21 commercial LLMs. Evaluation results reveal critical gaps in safety alignment for military applications.

该基准包含一个结构化的 12 类分类法、519 个基于条令的提示词，并对 21 个商业大语言模型进行了严格的评估程序。评估结果揭示了当前模型在军事应用安全对齐方面存在的关键差距。