A Theoretical Game of Attacks via Compositional Skills

基于组合技能的攻击理论博弈

Abstract: As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts.

摘要： 随着大型语言模型的能力日益增强，人们对其安全部署的担忧也随之加剧。尽管许多对齐策略旨在限制有害行为，但这些防御措施仍然可以通过精心设计的对抗性提示词（adversarial prompts）被绕过。

In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical best-response attack strategy and show that it is closely related to many existing adversarial prompting methods.

在这项工作中，我们引入了一个理论框架，将攻击者与防御者之间的博弈进行了形式化。在该框架内，我们设计了一种理论上的最优响应攻击策略，并证明它与许多现有的对抗性提示方法密切相关。

We further analyze the resulting game, characterize its equilibria, and reveal inherent advantages for the attacker. Drawing on our theoretical analysis, we also derive a provably optimal defense strategy.

我们进一步分析了由此产生的博弈，刻画了其均衡状态，并揭示了攻击者所具备的内在优势。基于我们的理论分析，我们还推导出了一个可证明的最优防御策略。

Empirically, we evaluate a practical instantiation of the theoretically optimal attack and observe stronger performance relative to existing adversarial prompting approaches in diverse settings encompassing different LLMs and benchmarks.

在实证方面，我们评估了该理论最优攻击的实际应用实例，并观察到在涵盖不同大语言模型（LLM）和基准测试的多种场景中，其表现优于现有的对抗性提示方法。

Paper Details:

Authors: Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney
arXiv ID: 2605.01034
Subject: Computation and Language (cs.CL)
Submission Date: 1 May 2026

论文详情：

作者： Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney
arXiv ID: 2605.01034
学科： 计算与语言 (cs.CL)
提交日期： 2026年5月1日