More details on Fable 5’s cyber safeguards and our jailbreak framework

More details on Fable 5’s cyber safeguards and our jailbreak framework

关于 Fable 5 网络安全防护措施及越狱框架的更多详情

Claude Fable 5 has been re-deployed and is now available globally for all users. We’re taking this opportunity to share further information in two areas. Claude Fable 5 已重新部署,现已面向全球所有用户开放。我们借此机会在两个方面分享更多信息。

First, we provide more information on the cybersecurity safeguards—specifically, the safety classifiers—that we launched with the model. These are the AI systems that accompany the model that detect and block dangerous (or potentially dangerous) cybersecurity uses. Here, we provide a detailed list of the types of harms Fable 5’s classifiers are, and are not, designed to prevent. 首先,我们提供了关于随模型一同推出的网络安全防护措施(特别是安全分类器)的更多信息。这些是伴随模型运行的 AI 系统,旨在检测并拦截危险(或潜在危险)的网络安全用途。在此,我们详细列出了 Fable 5 分类器旨在预防以及非旨在预防的危害类型。

Second, we lay out an early draft version of our proposed AI jailbreak severity framework, on which we’ve been working with our Glasswing partners. AI jailbreaks are unusual ways of prompting an AI model to bypass its safeguards, thus unblocking the behaviors (like dangerous or potentially dangerous cybersecurity tasks) we seek to prevent. 其次,我们提出了 AI 越狱严重程度框架的初步草案,这是我们与 Glasswing 合作伙伴共同制定的。AI 越狱是指通过非常规方式诱导 AI 模型绕过其防护措施,从而解锁我们试图阻止的行为(如危险或潜在危险的网络安全任务)。

Jailbreaks vary in severity: sometimes they only unblock minor undesirable behaviors, and sometimes they unblock a wide range of harmful outputs, making a model much more dangerous. Yet there is no agreed-upon framework for describing a given jailbreak’s severity. Such a framework would allow AI developers to speak to governments (and vice versa) in consistent terms about the risks posed by each jailbreak. 越狱的严重程度各不相同:有时它们仅解锁轻微的不良行为,有时则会解锁广泛的有害输出,使模型变得更加危险。然而,目前尚无公认的框架来描述特定越狱的严重程度。建立这样一个框架,将使 AI 开发人员能够与政府部门以统一的术语沟通每种越狱带来的风险。

What we’re sharing today reflects our current thinking. Our hope is to spark a helpful discussion across academia, industry, civil society, and government about how and where these lines should be drawn. We welcome feedback and critique on this framework at cyber-safeguards@anthropic.com. We’ve also launched a HackerOne program where security researchers can submit potential cyber jailbreaks they discover in Fable 5 for our review. 我们今天分享的内容反映了我们目前的思考。我们希望借此在学术界、工业界、民间社会和政府之间引发有益的讨论,探讨这些界限应如何以及在何处划定。欢迎通过 cyber-safeguards@anthropic.com 对该框架提供反馈和批评。我们还启动了一个 HackerOne 项目,安全研究人员可以提交他们在 Fable 5 中发现的潜在网络越狱漏洞,供我们审查。

We believe that by working together, we can establish a standard that enables the defensive uses of this technology while preventing its misuse. 我们相信,通过共同努力,我们可以建立一套标准,在防止技术滥用的同时,充分发挥其防御性用途。

Fable 5’s cyber safeguards

Fable 5 的网络安全防护措施

Areas such as cybersecurity are particularly challenging for AI safeguards because they are often dual use. That is, many cybersecurity capabilities can be used for benign or harmful purposes. For example, we want to allow cyber defenders to use our models to scan their codebases to find software vulnerabilities—but this same capability could, in the wrong hands, be the precursor to a cyberattack. 网络安全等领域对 AI 防护措施而言极具挑战性,因为它们通常具有双重用途。也就是说,许多网络安全能力既可用于良性目的,也可用于恶意目的。例如,我们希望允许网络防御者使用我们的模型扫描代码库以发现软件漏洞,但同样的能力如果落入坏人之手,可能会成为网络攻击的前奏。

For that reason, we do not intend to block all cybersecurity-related activities for Fable 5. Instead, we train our safety classifiers to discern between four categories of cybersecurity use, from the most clearly potentially dangerous to the most clearly potentially benign. These are summarized in the table below: 因此,我们并不打算拦截 Fable 5 的所有网络安全相关活动。相反,我们训练安全分类器来区分四类网络安全用途,从最明显的潜在危险到最明显的潜在良性。下表对此进行了总结:

CategoryDescriptionIntended classifier behavior
类别描述分类器预期行为
Prohibited useActivities that could be used to cause significant harm and/or harm in a significant majority of uses, with little-to-no defensive utilityBlock
禁止用途可能导致重大危害和/或在绝大多数使用场景中造成危害,且几乎没有防御价值的活动拦截
High-risk dual useActivities that are used widely by malicious actors, but also have beneficial applicationsBlock
高风险双重用途被恶意行为者广泛使用,但同时也具有有益应用的活动拦截
Low-risk dual useActivities that are mostly used for defensive benefit that can also provide value to malicious actorsMonitor; sometimes block as part of the safety margin to prevent meaningful jailbreaks
低风险双重用途主要用于防御目的,但也可能为恶意行为者提供价值的活动监控;有时作为安全边界的一部分进行拦截,以防止严重的越狱
Benign useActivities that do not cause harmAllow, with some monitoring
良性用途不会造成危害的活动允许,并进行一定程度的监控

Note that the low-risk dual use category overlaps considerably with what falls into the “safety margin” we described in our post on redeploying Fable. The safety margin includes many benign uses which we would prefer to allow, but which we block out of an abundance of caution. The safety margin means that a request has to look very clearly safe to avoid triggering the classifier. We can adjust the size of the safety margin to have greater confidence that the classifiers will catch harmful behaviors (for Fable 5, we made this margin larger than for previous models). 请注意,“低风险双重用途”类别与我们在关于重新部署 Fable 的文章中所述的“安全边界”有很大重叠。安全边界包含了许多我们倾向于允许的良性用途,但出于谨慎考虑,我们对其进行了拦截。安全边界意味着请求必须看起来非常安全,才能避免触发分类器。我们可以调整安全边界的大小,以确保分类器更有把握捕获有害行为(对于 Fable 5,我们比之前的模型扩大了这一边界)。

Classifiers are one piece in a broader set of safeguards. In addition to classifiers, we use access controls, model safety training, and offline monitoring to add additional safety layers. 分类器只是更广泛防护措施的一部分。除了分类器之外,我们还使用访问控制、模型安全训练和离线监控来增加额外的安全层。

Below, we provide detailed, specific examples of the kinds of uses that are included in each of the four classifier categories (as well as some uses that overlap with cybersecurity but which are out of the scope of these specific classifiers). These examples describe the current intended behavior of our classifiers, but note that the classifiers might change over time in response to feedback or lessons we learn from their behavior in the real world. 以下我们提供了四类分类器中每一类所包含用途的详细具体示例(以及一些与网络安全重叠但超出这些特定分类器范围的用途)。这些示例描述了我们分类器当前的预期行为,但请注意,分类器可能会随着时间的推移,根据反馈或我们在现实世界中从其行为中吸取的经验教训而发生变化。

Prohibited use

禁止用途

All security capabilities are dual use—that is, they can under certain circumstances be helpful to both attackers and defenders. The prohibited use actions listed here either have relatively little direct defensive benefit, are overtly criminal, or contribute to a very high degree of harm. What ties them together is the asymmetry in what they offer to attackers (far more) versus what they offer to defenders (much less). Since the risk associated with these capabilities is high, Fable 5’s classifiers are intended to block all of these requests. 所有安全能力都具有双重用途——即在某些情况下,它们既能帮助攻击者,也能帮助防御者。此处列出的禁止用途行为要么几乎没有直接的防御价值,要么是明显的犯罪行为,要么会导致极高程度的危害。它们的共同点在于攻击者从中获得的收益(远多)与防御者获得的收益(远少)之间的不对称性。由于与这些能力相关的风险很高,Fable 5 的分类器旨在拦截所有此类请求。

Prohibited use actions include: 禁止用途行为包括:

  • Destructive impact: ransomware/encryption-for-extortion, wipers, defacement, data or process integrity sabotage, and denial of service;
  • 破坏性影响: 勒索软件/勒索加密、擦除器、篡改网页、数据或流程完整性破坏以及拒绝服务攻击;
  • Cyber-physical sabotage: manipulating physical processes (power, water, oil/gas, transportation, medical devices) via digital means;
  • 网络物理破坏: 通过数字手段操纵物理流程(电力、水务、石油/天然气、交通、医疗设备);