Anthropic says these topics are too dangerous to let its Fable 5 model talk about
Anthropic says these topics are too dangerous to let its Fable 5 model talk about
Anthropic 表示,这些话题过于危险,其 Fable 5 模型将拒绝讨论
Anthropic Tuesday publicly released Claude Fable 5, its first “Mythos-class” model that it says surpasses its previous frontier Opus models in overall capabilities. But the model’s launch today comes with safeguards designed to prevent it from answering queries on topics like cybersecurity, biology, and chemistry, where the company has publicly worried about its potential impact to “uplift” malicious actors.
Anthropic 周二正式发布了 Claude Fable 5,这是其首个“Mythos 级”模型。该公司称,该模型在整体能力上超越了此前的旗舰 Opus 模型。然而,该模型在发布时配备了安全防护机制,旨在防止其回答有关网络安全、生物学和化学等领域的问题,因为该公司公开担心这些领域的信息可能会被恶意行为者利用。
Anthropic says Fable 5 operates on the “same underlying model” as Mythos 5, which is coming out of its monthslong “Mythos Preview” period today, but only for “a small group of cyberdefenders” judged trustworthy through the existing Project Glasswing. Unlike Mythos 5, though, the publicly accessible Fable 5 is designed to funnel queries on certain sensitive topics to the earlier Claude Opus 4.8 model and to warn the user when this is happening.
Anthropic 表示,Fable 5 与 Mythos 5 运行在“相同的底层模型”上。Mythos 5 在经历了数月的“Mythos 预览”期后于今日正式推出,但仅面向通过现有“Project Glasswing”项目评估为可信的“一小部分网络防御者”。与 Mythos 5 不同的是,面向公众的 Fable 5 被设计为将某些敏感话题的查询引导至较旧的 Claude Opus 4.8 模型,并在发生此类跳转时向用户发出警告。
Anthropic said it has tuned these safeguards to be “stricter than ideal,” meaning the system may occasionally refuse “harmless requests” in a way that it acknowledges may be frustrating for regular users. But Anthropic says such false positives come up in less than five percent of all sessions in testing, and were worth it to avoid situations where Mythos could give malicious actors assistance in “causing serious harm that they couldn’t have received from other sources.”
Anthropic 表示,他们已将这些防护措施调整得“比理想状态更严格”,这意味着系统偶尔可能会拒绝“无害请求”,公司承认这可能会让普通用户感到沮丧。但 Anthropic 指出,在测试中,此类误报在所有会话中的占比不到 5%,为了避免 Mythos 协助恶意行为者“造成其无法从其他渠道获取的严重伤害”,这种牺牲是值得的。
I can’t let you do that, Dave
我不能让你这么做,戴夫
Fable 5’s topic-based safeguards are built around a system of classifiers designed to broadly detect banned prompt subjects as well as any potential jailbreak attempts. In over 1,000 hours of red-team testing with a bug bounty program, Anthropic says external teams failed to find any universal jailbreaks for Fable 5. The new model also resisted automated jailbreak attempts to a much larger degree than previous Claude Opus models, Anthropic said.
Fable 5 基于主题的安全防护机制围绕一套分类器系统构建,旨在广泛检测违禁提示主题以及任何潜在的越狱尝试。Anthropic 表示,在通过漏洞赏金计划进行的超过 1,000 小时的红队测试中,外部团队未能找到任何针对 Fable 5 的通用越狱方法。此外,Anthropic 称,该新模型在抵御自动化越狱尝试方面也比之前的 Claude Opus 模型表现得更为出色。
The company said it is particularly worried about Mythos 5’s ability to perform “agentic hacking,” executing multi-part cyberattacks with much more facility than earlier models. But testing from the UK’s AI Security Institute in recent months found that Mythos Preview performed similarly to OpenAI’s GPT-5.5 on a suite of Capture the Flag challenges, suggesting Mythos’ performance is not “a breakthrough specific to one model.”
该公司表示,特别担心 Mythos 5 执行“代理式黑客攻击”(agentic hacking)的能力,即比早期模型更轻松地执行多阶段网络攻击。但英国人工智能安全研究所近期的测试发现,Mythos 预览版在一系列夺旗(CTF)挑战中的表现与 OpenAI 的 GPT-5.5 相似,这表明 Mythos 的性能并非“单一模型的突破”。
Among the usual raft of fair-to-middling benchmark test improvements that Anthropic reports for Mythos 5 over previous frontier models, the company claims a significant jump in the model’s capabilities on the cybersecurity-focused ExploitBench test. Mythos 5 scored a 78 percent on the benchmark’s tests of vulnerable code exploits, a significant increase from the 40 percent score from Opus 4.8, and even the 69 percent score achieved by Mythos Preview.
在 Anthropic 报告的 Mythos 5 较之前旗舰模型在各项基准测试中取得的平庸改进中,该公司声称该模型在专注于网络安全的 ExploitBench 测试中能力有显著提升。Mythos 5 在该基准测试的漏洞代码利用测试中获得了 78% 的分数,较 Opus 4.8 的 40% 和 Mythos 预览版的 69% 有了显著增长。
While earlier Anthropic models blocked bioweapons-related queries, that classifier now applies to all chemistry and biology-related queries in Fable 5. The company says it worries that “well-resourced malicious actors” could use even seemingly benign queries on these subjects to assist with “highly risky biological research” in a much more effective way than with previous models.
虽然 Anthropic 的早期模型仅拦截与生物武器相关的查询,但 Fable 5 的分类器现在适用于所有化学和生物学相关的查询。该公司表示,担心“资源充足的恶意行为者”可能会利用这些学科中看似无害的查询,以比以往模型更有效的方式辅助进行“高风险生物研究”。
Who can you trust?
你能信任谁?
Anthropic seems to understand that making certain topics off-limits for Fable 5 is something of a double-edged sword. The company writes that “the same queries that are beneficial in the hands of cybersecurity professionals and biology researchers could be dangerous if available to malicious actors.” That puts Anthropic in the somewhat awkward position of having to judge who is and is not trustworthy enough to have access to a model that it says has potentially dangerous capabilities.
Anthropic 似乎明白,将某些话题列为 Fable 5 的禁区是一把双刃剑。该公司写道:“对于网络安全专家和生物研究人员有益的查询,如果被恶意行为者获取,可能会变得非常危险。”这使 Anthropic 处于一种略显尴尬的境地:它必须判断谁有资格、谁没有资格访问这个被其认为具有潜在危险能力的模型。
The company says it will be periodically expanding its existing Project Glasswing program “in consultation with the US government” to let in more cybersecurity professionals. That expansion will also include a new trusted access program for life sciences organizations that removes Fable 5’s biology/chemistry safeguards while keeping cybersecurity safeguards in place.
该公司表示,将“在与美国政府协商后”定期扩大现有的 Project Glasswing 项目,以接纳更多的网络安全专业人员。此次扩展还将包括一项针对生命科学组织的新型可信访问计划,该计划将移除 Fable 5 的生物/化学防护,但保留网络安全防护。
API and Enterprise users will be able to access the Fable 5 model at a cost of $10-per-million input tokens and $50-per-million output tokens starting today. Those prices are 67 to 100 percent higher than those for OpenAI’s recent GPT-5.5, a difference that could be significant at a time when many users are balking at the high cost of frontier models. Anthropic’s existing subscription plans will include access to Fable 5 through June 22, after which users will need to purchase “usage credits” to access the new model. Anthropic says it eventually hopes to restore Fable 5 access as a standard part of subscription plans once it has “sufficient capacity” to do so.
从今天起,API 和企业用户可以访问 Fable 5 模型,价格为每百万输入 Token 10 美元,每百万输出 Token 50 美元。这些价格比 OpenAI 近期推出的 GPT-5.5 高出 67% 到 100%,在许多用户对旗舰模型高昂成本感到不满的当下,这一价格差异可能非常显著。Anthropic 现有的订阅计划将包含 Fable 5 的访问权限至 6 月 22 日,此后用户需要购买“使用额度”才能访问该新模型。Anthropic 表示,一旦拥有“足够的容量”,最终希望将 Fable 5 的访问权限恢复为订阅计划的标准部分。