CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models

CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models

CyberSecQwen-4B:为何防御性网络安全需要小型、专业化且可本地运行的模型

Frontier models are very good at very many things. They are also expensive to call, ship every prompt off to someone else’s datacenter, and are explicitly trained to refuse the messy edge cases a real defender lives in: incident write-ups, attacker-grade payloads found in your own logs, vulnerability disclosure drafts. 前沿模型在很多方面表现出色。但它们调用成本高昂,会将每个提示词发送到他人的数据中心,并且经过专门训练以拒绝处理真实防御者所处的混乱边缘情况:如事件报告、在日志中发现的攻击者级载荷以及漏洞披露草案。

Defensive cybersecurity is not a place where any of those tradeoffs are acceptable. Sensitive evidence stays internal. A SOC analyst triaging a leaked credential dump, a malware reverse-engineer dissecting a sample, a vulnerability researcher writing up a CVE — none of them should be pasting that content into a hosted API. The data itself can be the breach. 防御性网络安全领域无法接受上述任何一种权衡。敏感证据必须保留在内部。无论是正在分类泄露凭据转储的安全运营中心(SOC)分析师、正在剖析样本的恶意软件逆向工程师,还是正在撰写 CVE 的漏洞研究员,都不应将这些内容粘贴到托管的 API 中。数据本身就可能成为泄露源。

Per-call API cost compounds. A mid-size SOC processes thousands of low-confidence alerts per day. Hosted-API costs for “explain this CVE” or “what CWE applies here” turn defensive automation into a budget question. Air-gapped and partially-connected environments are the rule, not the exception in critical infrastructure, healthcare, and government work. If your tooling can’t run on a laptop or a single on-prem GPU, it doesn’t ship there. 按次调用的 API 成本会不断累积。一个中型 SOC 每天处理数千条低置信度警报。针对“解释此 CVE”或“此处适用何种 CWE”的托管 API 调用成本,使防御自动化变成了预算问题。在关键基础设施、医疗保健和政府工作中,气隙隔离和部分连接的环境是常态而非例外。如果你的工具无法在笔记本电脑或单台本地 GPU 上运行,它就无法在这些环境中部署。

Adversaries are getting more automated. Ransomware gangs use LLMs to draft phishing in 30 languages; bug-bounty automators chain agentic tools to fuzz, triage and exploit faster than humans can review. Defense at the same speed needs models defenders own and can run. So: local matters. But “local” alone isn’t enough. 对手正变得越来越自动化。勒索软件团伙使用大语言模型(LLM)以 30 种语言起草钓鱼邮件;漏洞赏金自动化工具链通过代理工具进行模糊测试、分类和利用,速度远超人类审核。要实现同等速度的防御,需要防御者能够拥有并运行的模型。因此:本地化至关重要。但仅有“本地化”是不够的。

Why a small specialized model, not just a small model? A 70B generalist running locally on four GPUs is “local” but it isn’t deployable. A 4B generalist running locally on a single consumer GPU is deployable but it doesn’t beat the 8B specialist on the work you actually need it to do. The bet behind CyberSecQwen-4B is that for narrow, well-evaluated cyber threat intelligence tasks — CWE classification, CVE-to-CWE mapping, structured CTI Q&A — a careful 4B fine-tune can match or beat an 8B specialist while fitting on a 12 GB consumer card. 为何需要小型专业化模型,而不仅仅是小型模型?一个在四块 GPU 上本地运行的 70B 通用模型虽然是“本地的”,但不可部署。一个在单块消费级 GPU 上本地运行的 4B 通用模型虽然可部署,但在你真正需要完成的工作上却无法胜过 8B 专业模型。CyberSecQwen-4B 背后的赌注在于:针对狭窄且经过充分评估的网络威胁情报任务(如 CWE 分类、CVE 到 CWE 的映射、结构化 CTI 问答),经过精心微调的 4B 模型可以在适配 12GB 消费级显卡的同时,达到或超越 8B 专业模型的效果。

We tested this against the strongest public baseline we could find: Cisco’s Foundation-Sec-Instruct-8B, evaluated under their own published protocol on CTI-Bench. 我们将其与所能找到的最强公开基准进行了对比:思科的 Foundation-Sec-Instruct-8B,并按照他们发布的 CTI-Bench 协议进行了评估。

Metric (CTI-Bench, n=5, temp 0.3)CyberSecQwen-4BFoundation-Sec-Instruct-8BΔ
CTI-MCQ (2,500 items)0.5868 ± 0.00290.4996+8.7 pp
CTI-RCM (1,000 CVE→CWE items)0.6664 ± 0.00230.6850−1.9 pp
Parameters4 B8 Bhalf the size

CyberSecQwen-4B retains 97.3% of Foundation-Sec-Instruct-8B’s CTI-RCM accuracy while exceeding its CTI-MCQ score by +8.7 points, at half the parameter count. That’s the only number that should matter to a defender choosing what to deploy. CyberSecQwen-4B 在参数量减半的情况下,保留了 Foundation-Sec-Instruct-8B 97.3% 的 CTI-RCM 准确率,同时 CTI-MCQ 分数高出 8.7 个百分点。对于选择部署方案的防御者来说,这才是唯一重要的数字。

Why AMD MI300X

为什么选择 AMD MI300X

The whole pipeline — training, adapter merging, evaluation — runs end-to-end on a single AMD Instinct MI300X 192 GB instance via the AMD Developer Cloud. The combination of 192 GB HBM3 and ROCm 7’s vLLM stack means we never had to think about quantization tricks, gradient checkpointing, or splitting the model across devices. Full bf16, FlashAttention-2 forward+backward, batch size of 4, sequence length 4096 — all on a single GPU. 整个流程(训练、适配器合并、评估)均通过 AMD 开发者云在单台 AMD Instinct MI300X 192GB 实例上端到端运行。192GB HBM3 与 ROCm 7 的 vLLM 堆栈相结合,意味着我们无需考虑量化技巧、梯度检查点或跨设备拆分模型。全 bf16 精度、FlashAttention-2 前向+反向传播、批大小为 4、序列长度 4096——所有这些都在单块 GPU 上完成。

The training data

训练数据

Two corpora, both Apache-2.0-clean to release: 两个语料库,均符合 Apache-2.0 开源协议:

  1. 2021 CVE → CWE mappings sourced from MITRE / NVD public records. Critically, all overlap with CTI-Bench’s evaluation set was deduplicated before training, so the benchmark numbers above are honest out-of-distribution holdouts and not contamination.
  2. 源自 MITRE / NVD 公共记录的 2021 年 CVE 到 CWE 映射。关键在于,所有与 CTI-Bench 评估集的重叠部分在训练前均已去重,因此上述基准测试数据是真实的分布外测试结果,而非数据污染。
  3. Synthetic defensive-analyst Q&A grounded in the deduplicated CVE descriptions. Generated with a stronger teacher and Apache-2.0-licensed for redistribution.
  4. 基于去重后 CVE 描述的合成防御分析师问答。由更强大的教师模型生成,并采用 Apache-2.0 许可协议以供重新分发。

The base model is Qwen3-4B-Instruct-2507, an Apache-2.0 instruction-tuned 4B that was the highest-performing 4B-class IT model available at training time. We deliberately fine-tune on the IT checkpoint (not the base) — it preserves the terse-answer multiple-choice format priors the IT pass had already established, which an IT-then-SFT collapse would have erased. 基础模型是 Qwen3-4B-Instruct-2507,这是一个采用 Apache-2.0 协议的指令微调 4B 模型,也是训练时性能最高的 4B 类指令微调模型。我们特意在指令微调(IT)检查点(而非基础模型)上进行微调,因为它保留了 IT 阶段已经建立的简洁回答多项选择格式先验,而“先 IT 后 SFT”的流程可能会导致这些先验丢失。