Project Glasswing: what Mythos showed us

Project Glasswing: what Mythos showed us

Project Glasswing:Mythos 向我们展示了什么

For the last few months, we’ve been testing a range of security-focused LLMs on our own infrastructure. These LLMs help identify potential vulnerabilities in our own systems, so we can fix them – and they also show us what attackers are going to be able to do with the latest models. 在过去的几个月里,我们一直在自己的基础设施上测试一系列专注于安全的大语言模型(LLM)。这些模型有助于识别我们系统中的潜在漏洞,以便我们进行修复;同时,它们也向我们展示了攻击者利用最新模型可能采取的行动。

None of these LLMs has captured more attention than Mythos Preview, from Anthropic. A few weeks ago, we were invited to use Mythos Preview as part of Project Glasswing. We soon pointed it at more than fifty of our own repositories – to see what it would find, and to see how it works. This post shares what we observed, what the models did well and what they didn’t, and how the architecture and process around them needs to change, so they can be used at scale. 在所有这些大模型中,Anthropic 推出的 Mythos Preview 最受关注。几周前,我们受邀参与“Project Glasswing”项目,使用 Mythos Preview。我们很快将其部署到五十多个自有代码库中,以观察它的发现能力及其工作方式。本文将分享我们的观察结果、模型的优缺点,以及为了实现规模化应用,其架构和流程需要做出哪些改变。

What changed with Mythos Preview

Mythos Preview 带来了什么改变

Mythos Preview is a real step forward, and it’s worth saying that plainly before getting into anything else. We’ve been running models against our code for a while now, and the jump from what was possible with previous general-purpose frontier models to what Mythos Preview does today is not just a refinement of what came before. Mythos Preview 是真正的进步,在深入探讨之前,这一点值得明确指出。我们运行针对代码的模型已经有一段时间了,从之前的通用前沿模型到如今 Mythos Preview 所能实现的功能,这种跨越不仅仅是对以往技术的改进。

It’s a different kind of tool doing a different kind of work, and that makes a clean apples-to-apples comparison to earlier models difficult. So rather than trying to benchmark Mythos Preview against general-purpose frontier models, it’s more useful to describe what it can actually do, and two features that stood out across the work we did with Mythos Preview: 这是一种执行不同任务的新型工具,因此很难将其与早期模型进行简单的同类比较。与其试图将 Mythos Preview 与通用前沿模型进行基准测试,不如描述它实际能做什么。在我们的测试中,以下两个特性尤为突出:

  • Exploit chain construction - A real attack rarely uses one bug. It chains several small attack primitives together into a working exploit. For instance, it might turn a use-after-free bug into an arbitrary read and write primitive, hijack the control flow, and use return-oriented programming (ROP) chains to take full control over a system. Mythos Preview can take several of these primitives and reason about how to combine them into a working proof. The reasoning it shows along the way looks like the work of a senior researcher rather than the output of an automated scanner.

  • 漏洞利用链构建 - 真正的攻击很少只利用一个漏洞,而是将多个小的攻击原语串联成一个有效的攻击。例如,它可能将一个“释放后使用”(use-after-free)漏洞转化为任意读写原语,劫持控制流,并利用面向返回编程(ROP)链来完全控制系统。Mythos Preview 能够获取多个此类原语,并推导出如何将它们组合成一个有效的验证方案。其推理过程看起来更像是资深研究人员的工作,而非自动化扫描器的输出。

  • Proof generation - Finding a bug and proving it’s exploitable are two different things, and Mythos Preview can do both. It writes code that would trigger the suspected bug, compiles that code in a scratch environment, and runs it. If the program does what the model expected, that’s the proof. If it doesn’t, the model reads the failure, adjusts its hypothesis, and tries again. The loop matters as much as the bugs it finds, because a suspected flaw without a working proof is speculation, and Mythos Preview closes that gap on its own.

  • 验证生成 - 发现漏洞与证明其可被利用是两码事,而 Mythos Preview 两者皆能。它会编写触发疑似漏洞的代码,在临时环境中编译并运行。如果程序表现符合模型预期,即为验证成功。如果失败,模型会读取错误信息,调整假设并再次尝试。这个循环过程与发现漏洞本身同样重要,因为没有有效验证的疑似缺陷只是猜测,而 Mythos Preview 自行填补了这一空白。

Some of what we describe above is not entirely unique to Mythos Preview. When we ran other frontier models through the same harness, they found a fair number of the same underlying bugs, and in some cases they got further than we expected on the reasoning side too. Where they fell short was at the point of stitching the pieces together. A model would identify an interesting bug, write a thoughtful description of why it mattered, and then stop, leaving the actual chain unfinished and the question of exploitability open. What changed with Mythos Preview is that a model can now take those low-severity bugs (which would traditionally sit invisible in a backlog) and chain them into a single, more severe exploit. 上述部分功能并非 Mythos Preview 所独有。当我们使用相同的测试框架运行其他前沿模型时,它们也发现了相当数量的底层漏洞,在推理方面有时也超出了我们的预期。它们的不足之处在于无法将这些碎片串联起来。其他模型通常能识别出一个有趣的漏洞,写出深刻的分析报告,然后就停止了,导致实际的攻击链未完成,漏洞的可利用性也悬而未决。Mythos Preview 的改变在于,它现在可以将那些低严重性的漏洞(传统上会被积压并被忽视)串联成一个单一的、更严重的攻击。

Model refusals in legitimate vulnerability research

合法漏洞研究中的模型拒绝

The Mythos Preview model provided by Anthropic, as part of Project Glasswing, did not have the additional safeguards that are present in generally available models (like Opus 4.7 or GPT-5.5). Despite this, the model organically pushes back on certain requests - much like the cyber capabilities that made it useful for vulnerability hunting, the model has its own emergent guardrails that sometimes cause it to push back on legitimate security research requests. But as we found, these organic refusals aren’t consistent - the same task, framed differently or presented in a different context, could produce completely different outcomes as illustrated in the examples below. 作为 Project Glasswing 的一部分,Anthropic 提供的 Mythos Preview 模型没有像通用模型(如 Opus 4.7 或 GPT-5.5)那样设置额外的安全防护。尽管如此,该模型仍会自发地拒绝某些请求——就像它在漏洞挖掘方面的网络能力一样,模型具有其自身的涌现式护栏,有时会导致它拒绝合法的安全研究请求。但我们发现,这些自发的拒绝并不一致——同一个任务,如果表述方式不同或语境不同,可能会产生完全不同的结果,如下例所示。

Example of Mythos Preview pushing back on building a working proof of concept

Mythos Preview 拒绝构建概念验证(PoC)的示例

For example, the model initially refused to do vulnerability research on a project, then agreed to perform the same research on the same code after an unrelated change to the project’s environment. Nothing about the code being analyzed had changed. In another case, the model found and confirmed several serious memory bugs in a codebase, and then refused to write a demonstration exploit. The same request, framed differently, got a different answer, and even the same request can produce different outcomes across runs due to the probabilistic nature of the model. Semantically equivalent tasks can produce opposite outcomes depending on how and when they’re presented to the model. 例如,模型最初拒绝针对某个项目进行漏洞研究,但在项目环境发生无关变化后,它却同意对同一代码执行相同的研究。被分析的代码本身没有任何变化。在另一个案例中,模型在代码库中发现并确认了几个严重的内存漏洞,随后却拒绝编写演示攻击。同样的请求,换一种表述方式就得到了不同的回答;由于模型的概率特性,即使是同一个请求在多次运行中也可能产生不同的结果。语义上等效的任务,根据呈现给模型的方式和时间,可能会产生截然相反的结果。

This matters because while the model’s organic refusals/guardrails are real, they aren’t consistent enough to serve as a complete safety boundary on their own. That’s precisely why any capable cyber frontier model made generally available in the future must include additional safeguards on top of this baseline behavior - making it appropriate for broader use outside of a controlled research context like Project Glasswing. 这一点很重要,因为虽然模型的自发拒绝/护栏是真实存在的,但它们的一致性不足以作为完整的安全边界。这正是为什么未来任何可公开使用的强大网络前沿模型,都必须在这一基准行为之上增加额外的安全防护,使其能够适用于 Project Glasswing 这种受控研究环境之外的更广泛用途。

The signal-to-noise problem

信噪比问题

One of the hardest parts of triaging security vulnerabilities is deciding which bugs are real, which are exploitable, and which need fixing now. This was a hard problem even in the pre-AI world. AI vulnerability scanners and AI-generated code have made it worse, and at Cloudflare we’ve built multiple post-validation stages to deal with it. 安全漏洞分类中最困难的部分之一是确定哪些漏洞是真实的、哪些是可利用的,以及哪些需要立即修复。即使在 AI 时代之前,这也是一个难题。AI 漏洞扫描器和 AI 生成的代码使情况变得更糟,而在 Cloudflare,我们已经建立了多个验证后阶段来处理这一问题。

Two factors dominate the noise rate: 两个因素决定了噪声率:

  • Programming language - C and C++ give you direct memory control and, with it, bug classes - buffer overflows, out-of-bounds reads and writes - that memory-safe languages like Rust eliminate at compile time. We saw consistently more false positives from projects written in memory-unsafe languages.
  • 编程语言 - C 和 C++ 赋予了直接的内存控制权,同时也带来了缓冲区溢出、越界读写等漏洞类型,而 Rust 等内存安全语言在编译时就消除了这些问题。我们发现,使用内存不安全语言编写的项目产生的误报率始终较高。