What Gemma 4 Actually Unlocks for a Local Security Swarm (And Why I Don't Use the Same Variant Everywhere)
What Gemma 4 Actually Unlocks for a Local Security Swarm (And Why I Don’t Use the Same Variant Everywhere)
Gemma 4 究竟为本地安全集群解锁了什么(以及为什么我不会在所有地方使用同一个变体)
Gemma 4 Challenge: Write about Gemma 4 Submission. This is a submission for the Gemma 4 Challenge: Write About Gemma 4. I’ve been building an offline, multi-tier adversarial agent swarm on a single workstation — an RTX 5070 (12GB VRAM), a Ryzen 9 9950X3D, zero cloud calls, zero external dependencies, and zero vendor content restrictions. The swarm acts as an autonomous “Blue Team”: it audits, scans, correlates threats, and, where appropriate, simulates the attacker side of an engagement against the assets it protects. Gemma 4 挑战:关于 Gemma 4 的投稿。这是我为 Gemma 4 挑战赛提交的文章。我一直在单台工作站上构建一个离线、多层级的对抗性智能体集群——配置为 RTX 5070(12GB 显存)、Ryzen 9 9950X3D,全程无云端调用、无外部依赖、无供应商内容限制。该集群充当自主的“蓝队”:它负责审计、扫描、关联威胁,并在适当的情况下,模拟针对其所保护资产的攻击行为。
When the Gemma 4 family dropped, the question I had wasn’t should I use it. A local-first, capable, open-license, multimodal model with a 128K context window is an automatic yes. The genuinely interesting question was: which variant goes where? That’s the question I think most “I tried the new model” posts skip past. The Gemma 4 lineup isn’t just one model cut into three sizes. It’s three distinct architectural answers to three different deployment problems. Picking the right one per role is where you find real leverage. 当 Gemma 4 系列发布时,我考虑的不是“是否应该使用它”。一个本地优先、能力强大、开源许可且具备 128K 上下文窗口的多模态模型,答案显然是肯定的。真正有趣的问题是:哪个变体应该用在哪里?我认为大多数“我试用了新模型”的文章都忽略了这一点。Gemma 4 产品线不仅仅是将一个模型切分成三种尺寸,它是针对三种不同部署问题的三种截然不同的架构方案。为每个角色选择最合适的模型,才是发挥真正效能的关键。
The Lineup, Architecturally
架构层面的产品阵容
For anyone who hasn’t pulled the spec sheet yet: 对于还没看过规格说明书的人:
-
Gemma 4 E2B / E4B — Small effective-parameter models built for the edge: phones, browsers, ambient compute. Fast time-to-first-token, tiny VRAM footprint, and you can run many of them concurrently.
-
Gemma 4 E2B / E4B — 为边缘设备(手机、浏览器、环境计算)构建的小型有效参数模型。首字生成速度快,显存占用极小,且可以同时运行多个实例。
-
Gemma 4 26B MoE — Mixture-of-Experts. Total parameters are massive, but only a fraction activate per token. Designed for high throughput with strong reasoning on a per-task basis. It takes up space in memory, but it’s computationally much cheaper to run than its parameter count suggests.
-
Gemma 4 26B MoE — 混合专家模型。总参数量巨大,但每个 token 仅激活一小部分。专为高吞吐量设计,在单任务基础上具备强大的推理能力。它虽然占用内存空间,但运行时的计算成本远低于其参数量所暗示的水平。
-
Gemma 4 31B Dense — Server-grade local. Every parameter fires on every token. Predictable inference cost and generally the strongest reasoning ceiling of the three, but carries the highest VRAM tax and latency floor.
-
Gemma 4 31B Dense — 服务器级本地模型。每个 token 都会激活所有参数。推理成本可预测,且通常是三者中推理上限最强的,但显存占用最高,延迟基准也最高。
All three share the same training lineage, the same 128K context window, and the same multimodal head. They differ entirely on activation patterns, footprint, and what kind of work they are built to absorb. 这三者共享相同的训练血统、相同的 128K 上下文窗口以及相同的多模态头。它们在激活模式、资源占用以及所能承担的工作类型上完全不同。
Casting Models by RBAC Tier
基于 RBAC 层级的模型分配
The swarm uses a 6-tier zero-trust Role-Based Access Control (RBAC) system. Tier 6 is the most privileged — supervisors that can spawn, terminate, and de-escalate other agents. Tier 5 is the least privileged — ambient scanners that watch logs, file changes, and network deltas. Every privileged action routes through a hardcoded PermissionGate that doesn’t care what the model wants; if the tier doesn’t permit it, the call dies. 该集群使用 6 层零信任的基于角色的访问控制(RBAC)系统。第 6 层权限最高——作为主管,可以生成、终止和降级其他智能体。第 5 层权限最低——作为环境扫描器,监控日志、文件变更和网络增量。每个特权操作都通过硬编码的“权限门”(PermissionGate)进行路由,它不关心模型想要什么;如果该层级不允许,调用就会直接终止。
This matters for model casting because higher tiers don’t just need smarter agents — they need slower, more deliberate ones. A supervisor that fires off twenty execution plans a second is a massive liability. Conversely, an ambient scanner that thinks for three seconds before flagging a file change is useless. So, the question per tier is: how much reasoning depth, how much latency tolerance, and how many instances do we need concurrently? 这对模型分配至关重要,因为更高层级不仅需要更聪明的智能体,还需要更缓慢、更审慎的智能体。一个每秒发出二十个执行计划的主管是一个巨大的隐患。相反,一个在标记文件变更前需要思考三秒的环境扫描器是毫无用处的。因此,每个层级的问题在于:我们需要多少推理深度、多少延迟容忍度,以及需要同时运行多少个实例?
Where Each Variant Earns Its Slot
各变体的用武之地
-
The E2B / E4B at the edges (Tiers 4–5). Ambient watchers, log diffing, simple anomaly flagging, and “is this string weird” classification. The work here is high-volume, mostly pattern-shaped, and low stakes per call. I need several of these running concurrently with zero VRAM drama. A small model that returns a token in tens of milliseconds and lets me run multiples in parallel easily beats a 31B Dense that locks the GPU for seconds. Edge Gemma 4 is built for exactly this shape of work.
-
边缘层(第 4-5 层)使用 E2B / E4B。 负责环境监控、日志差异分析、简单的异常标记以及“这个字符串是否异常”的分类。这里的工作量大、多为模式化任务,且单次调用的风险较低。我需要同时运行多个实例,且不能占用过多显存。一个能在几十毫秒内返回 token 并允许我轻松并行运行多个实例的小模型,远胜于一个会锁定 GPU 数秒的 31B Dense 模型。Edge Gemma 4 正是为这种工作形态而生的。
-
The 26B MoE in the middle (Tiers 2–3). Triage, correlation, and threat synthesis (“you’ve got fifteen of these alerts — is an attack chain forming?”). The MoE architecture fits here for a specific reason: middle-tier work is bursty. You have quiet stretches followed by a sudden need to reason hard about a correlated set of events. MoE’s sparse activation means we get 31B-class reasoning without the relentless compute tax of a dense model. The 128K context window pays for itself here too, allowing triage agents to ingest a long correlation window of events in a single shot.
-
中间层(第 2-3 层)使用 26B MoE。 负责分诊、关联分析和威胁综合(“你收到了 15 条警报——是否正在形成攻击链?”)。MoE 架构之所以适合这里,是因为中间层的工作具有突发性。你可能会经历一段平静期,随后突然需要对一组关联事件进行深度推理。MoE 的稀疏激活意味着我们无需承担 Dense 模型那种持续的计算负担,就能获得 31B 级别的推理能力。128K 的上下文窗口在这里也发挥了巨大作用,允许分诊智能体一次性摄入长周期的关联事件。
-
The 31B Dense at the top (Tiers 5–6) — with caveats. Supervisors, planners, and adversarial scenario generation. Dense earns its slot here because top-tier reasoning needs to be predictable. When an MoE routes to a different expert mix on a similar query, you can occasionally get stochastic depth. For a supervisor agent deciding whether to spawn a sub-agent at a different privilege tier, I want mathematical uniformity more than peak throughput. Dense delivers that.
-
顶层(第 5-6 层)使用 31B Dense——但有前提。 负责主管、规划和对抗场景生成。Dense 模型之所以能占据这个位置,是因为顶层推理需要具备可预测性。当 MoE 在处理相似查询时路由到不同的专家组合时,有时会产生随机的深度差异。对于一个决定是否在不同权限层级生成子智能体的主管智能体来说,我更看重数学上的统一性,而非峰值吞吐量。Dense 模型正好能提供这一点。
-
The Caveat: On a single-card 12GB 5070, a 31B Dense model is the heavyweight in the room. It cannot coexist concurrently with the MoE and a stack of edge models without aggressive quantization and careful orchestration. Mine gets gated through an HTTP inference queue — agents request inference, the gateway serializes the high-cost calls, and the small models keep running in their own lane. It’s not glamorous infrastructure, but it’s what makes the casting work.
-
前提条件: 在单卡 12GB 显存的 5070 上,31B Dense 模型是个“重量级选手”。如果不进行激进的量化和精心的编排,它无法与 MoE 和一堆边缘模型同时运行。我的方案是通过 HTTP 推理队列进行门控——智能体请求推理,网关将高成本调用序列化,而小模型则在各自的通道中持续运行。这虽然不是什么华丽的基础设施,但正是它让这种模型分配方案得以实现。
What I Actively Avoid
我极力避免的做法
Based on this architecture, here are a few patterns I actively avoid: 基于这种架构,以下是我极力避免的几种模式:
-
Don’t use the 31B Dense everywhere just because it’s the strongest. Latency at the bottom tier kills a swarm’s situational awareness. You’ll miss live events because your “ambient” watchers are blocked behind a heavy inference floor.
-
不要因为 31B Dense 最强就到处使用它。 底层的延迟会扼杀集群的态势感知能力。你会因为“环境”监控器被沉重的推理任务阻塞而错过实时事件。
-
Don’t put the MoE on supervisor duty. I like the model. I just don’t want stochastic expert routing inside the agent that decides whether another agent gets disk-write permissions.
-
不要让 MoE 担任主管职责。 我很喜欢这个模型,但我不想让决定“是否赋予另一个智能体磁盘写入权限”的智能体内部出现随机的专家路由。
-
Don’t put the E2B/E4B on triage. Edge models are great at answering “is this weird?”, but weak at “what does it mean across these fifteen events?” Triage is the rung where context and parameter count win, not throughput.
-
不要让 E2B/E4B 负责分诊。 边缘模型擅长回答“这是否异常?”,但不擅长回答“这十五个事件意味着什么?”。分诊层级需要的是上下文和参数量,而不是吞吐量。
The Takeaway
总结
The Gemma 4 release is remarkable because the variants are legitimately different tools, not just three sizes of the same hammer. The MoE isn’t “the 31B but smaller,” and the E2B isn’t “the E4B but worse.” Each one is shaped for a specific class of work. For a local-first, zero… Gemma 4 的发布非常出色,因为这些变体确实是不同的工具,而不仅仅是同一把锤子的三种尺寸。MoE 并不是“缩小的 31B”,E2B 也不是“更差的 E4B”。每一个模型都是为特定类型的工作而塑造的。对于一个本地优先、零……