A 10 year old Xeon is all you need

A 10 year old Xeon is all you need

你只需要一台 10 年前的至强(Xeon)处理器

The previous post covered getting Gemma 4’s MTP drafters quantized and paired with a verifier. This one is about running the result on a machine that has no business running it. 上一篇文章介绍了如何对 Gemma 4 的 MTP 草稿模型(drafter)进行量化并与验证器(verifier)配对。本文将探讨如何在根本不适合运行此类任务的机器上运行这些模型。

I have a recycled server. To its credit, it has a whopping 128 GB RAM, but it’s DDR3… That RAM is 5-6 times slower than the current best laptop ram. It also has a single Intel Xeon E5-2620 v4 from 2016, which is about 5 times slower than my laptops CPU… Oh, and as I did mention, we have no GPU. And no, the Xeon does not have an integrated GPU. 我有一台回收的服务器。值得称赞的是,它拥有高达 128 GB 的内存,但那是 DDR3……这种内存比目前最好的笔记本电脑内存慢 5 到 6 倍。它还搭载了一颗 2016 年的英特尔至强 E5-2620 v4 处理器,比我的笔记本电脑 CPU 慢了约 5 倍……哦,正如我提到的,我们没有 GPU。而且,这颗至强处理器也没有集成显卡。

But, just hear me out… If we were to just break out ollama here, well… as explained in earlier blog posts, we can’t. And we’d be lucky if we could in 6 months when they add support for the model we need, if they ever do. Might be they never do. And even still, ollama simply doesn’t expose enough knobs for us to ever make this run well, neither does even the standard llama-cpp. 但是,请听我说……如果我们在这里直接使用 Ollama,嗯……正如之前的博文所解释的那样,我们做不到。如果他们能在 6 个月后增加对我们所需模型的支持(如果他们真的会加的话),那我们就算走运了。也许他们永远都不会支持。即便如此,Ollama 提供的可调参数也不足以让我们将其运行良好,标准的 llama-cpp 也不行。

But. Why would that stop us? 但是,这又怎么能阻止我们呢?

I’ve recieved feedback that some of the previous posts were too high level, I’ll try to make things as clear as reasonably possible here. If you’re a tech worker, or a Linux enthusiast that has built a computer and used something like ChatGPT, most of this should be approachable. 我收到反馈说之前的一些文章太深奥了,我会尽量在这里把事情讲得尽可能清楚。如果你是一名技术人员,或者是一位组装过电脑并使用过 ChatGPT 之类工具的 Linux 爱好者,那么大部分内容应该是可以理解的。

So, just to really set the stage fully. The hardware, per lscpu: 为了全面了解情况,以下是根据 lscpu 命令显示的硬件配置:

CPU: Intel Xeon E5-2620 v4 @ 2.10 GHz Cores: 8 physical, 16 threads Instruction sets: AVX2 (no AVX-512, no AVX-VNNI, no BF16) Cache: 20 MiB L3, 2 MiB L2 total Memory: 128 GB DDR3 GPU: none CPU:英特尔至强 E5-2620 v4 @ 2.10 GHz 核心:8 物理核心,16 线程 指令集:AVX2(无 AVX-512,无 AVX-VNNI,无 BF16) 缓存:20 MiB L3,共 2 MiB L2 内存:128 GB DDR3 GPU:无

For LLM inference, memory bandwidth is the limiting resource. Every token generated requires hauling gigabytes of weights from RAM into the CPU cache. 对于大语言模型(LLM)推理而言,内存带宽是限制性资源。每生成一个 token,都需要将数 GB 的权重从内存搬运到 CPU 缓存中。

When you use a tool like ChatGPT and watch the text stream onto your screen word by word, you are watching the “decoder pass”. During this phase, the model generates the output one piece (or “token”) at a time. 当你使用 ChatGPT 这样的工具,看着文字逐字出现在屏幕上时,你看到的就是“解码过程”(decoder pass)。在此阶段,模型一次生成一个片段(即“token”)。

In this step, the system’s raw processing power is rarely the bottleneck. Instead, the limitation is memory bandwidth. To calculate that next word, the processor has to constantly pull massive amounts of data. That data is the “weights” that contain the model’s learned knowledge. It moves this from memory into the compute cores. 在这一步中,系统的原始处理能力很少成为瓶颈。相反,限制因素是内存带宽。为了计算下一个词,处理器必须不断地拉取海量数据。这些数据就是包含模型所学知识的“权重”。它将这些数据从内存移动到计算核心中。

The processor executes the required matrix calculations so quickly that it is left sitting idle, waiting for the hardware to physically move the next chunk of weights across the memory bus. In traditional software terms, decoding is heavily memory-bound, not compute-bound. 处理器执行所需的矩阵计算速度非常快,以至于它大部分时间都在闲置,等待硬件通过内存总线物理搬运下一块权重数据。用传统的软件术语来说,解码过程是严重的“内存受限”(memory-bound),而非“计算受限”(compute-bound)。

This is the so called “memory wall”, one of the single biggest performance hurdles now, whether you’re on a Xeon or an H100. 这就是所谓的“内存墙”,它是目前最大的性能障碍之一,无论你使用的是至强处理器还是 H100 显卡。

Naively running llama-cli on a DDR3 machine without a GPU is horrendously slow, even if it can run it, because it’s optimized for a generic GPU usecase, and often leaves a lot of improvements on the table. Further, it simply doesn’t have most of the actual optimizations that the state of the art currently uses to run these at scale. 在没有 GPU 的 DDR3 机器上直接运行 llama-cli 会慢得可怕(即使能运行),因为它针对通用的 GPU 使用场景进行了优化,往往忽略了许多改进空间。此外,它根本不具备目前业界在大规模运行这些模型时所使用的绝大多数先进优化技术。

The remedy is to pull every optimization lever ik_llama.cpp exposes. Most of them are slightly obscure. 补救措施是拉动 ik_llama.cpp 提供的每一个优化杠杆。其中大多数设置都比较晦涩。

Here is the magic spell that makes this actually run. 以下是让它真正跑起来的“魔法咒语”:

llama-cli \
  --model gemma-4-26B-A4B-it-Q8_0.gguf \
  --model-draft gemma-4-26B-A4B-it-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf \
  --spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune \
  -cnv --color --jinja --special \
  -sm graph -smgs -sas -mea 256 --split-mode-f32 \
  --temp 0.7 -t 8 --parallel 8 \
  --cpu-moe --merge-up-gate-experts \
  --flash-attn on --mla-use 3 \
  --mlock --run-time-repack --no-kv-offload

Under a blackbox tool like ollama you never see this line. On aging hardware you have to understand what each flag does, because half of them won’t take, and the engine will tell you so in passing. 在使用 Ollama 这样的黑盒工具时,你永远看不到这行命令。在老旧硬件上,你必须理解每个标志的作用,因为其中一半可能无法生效,而引擎会顺便告诉你这一点。

Speculative decoding (推测解码)

--spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune This pairs the 26B verifier with the small drafter from the previous post. Up to three tokens per draft (--draft-max 3), all probabilities accepted (--draft-p-min 0.0), --spec-autotune adjusting the chain length per workload. 这会将 26B 的验证器与上一篇文章中的小型草稿模型配对。每次草稿最多生成三个 token(--draft-max 3),接受所有概率(--draft-p-min 0.0),并使用 --spec-autotune 根据工作负载调整链长。

This ties directly back to our previous discussion about the memory-bound decoder pass. 这直接呼应了我们之前关于“内存受限解码过程”的讨论。

When a model uses a long reasoning chain, it is generating those “thinking” tokens one by one. Even if the internal reasoning is hidden from the user and all you see is a short final answer, the hardware still has to perform a full decoder pass for every single token in that hidden chain. 当模型使用长推理链时,它会逐个生成那些“思考”token。即使内部推理过程对用户隐藏,你看到的只是一个简短的最终答案,硬件仍然必须为隐藏链中的每一个 token 执行完整的解码过程。

In fact, speculative decoding is currently one of the most brilliant software workarounds the AI industry has invented to bypass the “memory wall,” and spec autotune is how you squeeze the maximum speed out of it. 事实上,推测解码是目前 AI 行业发明的最巧妙的软件变通方法之一,旨在绕过“内存墙”,而自动调优(spec autotune)则是榨取其最大速度的关键。

The argument for speculative decoding is stronger on CPU than on GPU. CPU compute is cheap relative to the cost of streaming the verifier’s weights through cache, so spending extra cycles on a tiny drafter whose active layers easily fit in L3 buys tokens at very little marginal cost. The drafter’s working set fits in L3. The verifier however spills out of everything. 在 CPU 上使用推测解码的理由比在 GPU 上更充分。相对于将验证器权重通过缓存流式传输的成本,CPU 计算成本很低,因此在小型草稿模型上花费额外的周期(其活跃层很容易放入 L3 缓存)可以以极低的边际成本获得 token。草稿模型的工作集可以放入 L3,而验证器则会溢出到所有缓存之外。

CPU and MoE routing (CPU 与 MoE 路由)

--cpu-moe --merge-up-gate-experts -t 8 --parallel 8 Gemma 4 26B-A4B has 128 experts with 8 active per token, giving about 3.8B active parameters out of ~25.2B total. --cpu-moe tunes the routing for CPU cache hierarchies. Gemma 4 26B-A4B 拥有 128 个专家,每个 token 激活 8 个,在约 252 亿总参数中激活约 38 亿参数。--cpu-moe 针对 CPU 缓存层级调整了路由。

CPUs handle memory very differently than GPUs. While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip. CPU 处理内存的方式与 GPU 大不相同。GPU 拥有海量超高速高带宽内存(HBM),而 CPU 依赖于直接构建在处理器芯片上的小型、极速“缓存”(L1、L2、L3)。

In an MoE model, constantly jumping around between 128 different experts can cause “cache thrashing”, where the CPU constantly has to dump its cache and fetch new weights from the much slower main system RAM (normally DDR4/DDR5, we’re on DDR3!). 在 MoE 模型中,在 128 个不同的专家之间不断跳转会导致“缓存抖动”(cache thrashing),即 CPU 不得不频繁清空缓存,并从慢得多的主系统内存(通常是 DDR4/DDR5,而我们用的是 DDR3!)中获取新权重。

This flag tells the router to be smarter about how it picks experts, optimizing the sequence so the weights stay neatly inside the CPU’s local cache for as long as possible. 此标志告诉路由器在选择专家时要更聪明,优化序列,使权重尽可能长时间地保留在 CPU 的本地缓存中。

--merge-up-gate-experts fuses two per-expert projections into a single --merge-up-gate-experts 将两个专家投影合并为一个……