How to setup a local coding agent on macOS
How to setup a local coding agent on macOS
如何在 macOS 上搭建本地编程助手
I’d had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the “Gemma 4 now runs 2x faster with MTP” Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running. 最近我的网络几次中断,导致我无法使用编程助手。因此,当我看到 Gemma 4 的“MTP(多标记预测)更新使其运行速度提升 2 倍”的消息时,我决定尝试将其运行起来。
I wanted a local coding agent setup that: was fast enough to actually use on my Mac, worked through an OpenAI compatible API (so I could use it in other tools), and preferably could handle screenshots/images when needed, so I can feed it screenshots of what it has made. And I did! This video is realtime. And shows the agent responding at a perfectly usable speed. 我想要一个满足以下条件的本地编程助手:在 Mac 上运行速度足够快、通过兼容 OpenAI 的 API 工作(以便在其他工具中使用)、并且最好能在需要时处理截图/图像,这样我就可以把它的生成结果截图发给它。我做到了!视频是实时录制的,展示了助手以非常理想的速度进行响应。
After a bit of testing the final setup I ended up with is: 经过一番测试,我最终的配置如下:
- llama.cpp built with Metal on macOS
- macOS 上使用 Metal 构建的 llama.cpp
- Gemma 4 26B-A4B in GGUF format
- GGUF 格式的 Gemma 4 26B-A4B
- A Q8 MTP draft model for speculative decoding
- 用于推测解码的 Q8 MTP 草稿模型
- The Gemma 4 multimodal projector
- Gemma 4 多模态投影仪
- Pi as the terminal coding agent
- 作为终端编程助手的 Pi
This was tested on an Apple M1 Max with 64 GB unified memory, running macOS 15.7.7. 此配置在配备 64GB 统一内存的 Apple M1 Max 上进行了测试,运行系统为 macOS 15.7.7。
The Model / 模型
The main model is: gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf.
主模型为:gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf。
Link on Huggingface: models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
Huggingface 链接:models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
That file is about 16 GB. With the MTP draft head and multimodal projector the model folder is about 17 GB. 该文件约为 16 GB。加上 MTP 草稿头和多模态投影仪,模型文件夹总大小约为 17 GB。
The benchmark prompt was: “Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.” Each benchmark generated about 128 tokens. 基准测试提示词为:“编写一个简洁的 Python 函数来解析统一差异(unified diff)并返回更改的文件路径。然后解释两个边缘情况。”每次基准测试生成约 128 个 token。
Baseline: llama.cpp + Metal / 基准:llama.cpp + Metal
First I ran the main model directly through llama.cpp with Metal acceleration: 首先,我通过带有 Metal 加速的 llama.cpp 直接运行了主模型:
repos/llama.cpp/build/bin/llama-cli \
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
-ngl 999 \
-fa on \
-c 4096 \
-n 128
Result:
| Setup | Prompt tok/s | Generation tok/s |
|---|---|---|
| Gemma 4 26B-A4B Q4, llama.cpp Metal | 298.0 | 58.2 |
58 tokens/second is not fast, but is usable, but for coding-agent work you want it to be as fast as possible, especially when the agent is making many tool calls. 每秒 58 个 token 虽然不算快,但可以使用。不过对于编程助手来说,速度越快越好,尤其是在助手需要进行大量工具调用时。
Adding the MTP Draft Model / 添加 MTP 草稿模型
Gemma 4 now has the MTP draft model available: MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf. This can be loaded by llama.cpp as a speculative draft model:
Gemma 4 现在提供了 MTP 草稿模型:MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf。它可以作为推测草稿模型由 llama.cpp 加载:
repos/llama.cpp/build/bin/llama-cli \
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-ngl 999 \
-fa on \
-c 4096 \
-n 128
The first run with MTP came in at 69.2 tokens/second using 4 draft tokens. However, Unsloth’s guide on How to Run MTP Models includes this note: “We found —spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system.”
首次运行 MTP 时,使用 4 个草稿 token 达到了每秒 69.2 个 token。然而,Unsloth 关于如何运行 MTP 模型的指南中提到:“我们发现 --spec-draft-n-max 2 是最佳起点,但不要假设 2 就是最优值,因为性能取决于硬件。尝试 1 到 6 之间的任何值,并使用对你系统最快的一个。”
After sweeping --spec-draft-n-max, the best result was 72.2 tokens/second with 3 draft tokens.
在遍历 --spec-draft-n-max 后,最佳结果是使用 3 个草稿 token,达到每秒 72.2 个 token。
| Setup | Prompt tok/s | Generation tok/s | Speedup |
|---|---|---|---|
| Main model only | 298.0 | 58.2 | 1.00x |
| Main model + Q8 MTP draft | 295.6 | 72.2 | 1.24x |
The useful part is that prompt processing stayed basically the same, while generation improved by about 24%. 值得一提的是,提示词处理速度基本保持不变,而生成速度提升了约 24%。
Tuning MTP / 调整 MTP
I tested --spec-draft-n-max values from 1 to 6. On my M1 Max machine, 3 was the fastest, with 2 close enough that either would be fine. Values above that got slower.
我测试了 1 到 6 的 --spec-draft-n-max 值。在我的 M1 Max 机器上,3 是最快的,2 的表现也很接近,两者皆可。超过该值后速度反而变慢了。
MLX Comparison / MLX 对比
I also tested MLX models through mlx-lm, to find out which is the faster way to run the model on a Mac, llama.cpp or mlx.
我还通过 mlx-lm 测试了 MLX 模型,以找出在 Mac 上运行模型更快的方案:llama.cpp 还是 MLX。
| Runtime | Model | Generation tok/s |
|---|---|---|
| llama.cpp Metal + MTP | Unsloth GGUF Q4 + Q8 MTP | 72.2 |
| llama.cpp Metal | Unsloth GGUF Q4 | 58.2 |
| MLX-LM | Unsloth UD MLX 4-bit | 45.8 |
| MLX-LM | mlx-community 4-bit | 43.9 |
| MLX-LM | mlx-community OptiQ 4-bit | 38.1 |
I thought MLX (being optimised for the Mac) would be fastest. However, for this specific setup, llama.cpp was faster than MLX, and llama.cpp with MTP was clearly the best option. 我原以为 MLX(针对 Mac 优化)会是最快的。然而,对于这个特定配置,llama.cpp 比 MLX 更快,且带有 MTP 的 llama.cpp 显然是最佳选择。
Adding Image Support / 添加图像支持
For Pi, I also wanted to be able to attach screenshots. The local model entry I setup for it originally declared the model as text-only: "input": ["text"]. That meant Pi did not send image tool output through to the model properly.
对于 Pi,我还希望能够附加截图。我最初为它设置的本地模型条目声明该模型仅支持文本:"input": ["text"]。这意味着 Pi 无法将图像工具的输出正确发送给模型。
The llama.cpp server also needs the Gemma 4 multimodal projector in order for the multi-modal part to work: mmproj-BF16.gguf. When loaded with --mmproj, llama.cpp advertises multimodal support, and Pi can send images.
llama.cpp 服务器还需要 Gemma 4 多模态投影仪才能使多模态功能正常工作:mmproj-BF16.gguf。当使用 --mmproj 加载时,llama.cpp 会声明支持多模态,Pi 就可以发送图像了。
I re-ran the text benchmark with the projector loaded, just to check it didn’t change the speed: 我重新运行了加载投影仪后的文本基准测试,以确保它不会影响速度:
| Setup | Projector | Prompt tok/s | Generation tok/s |
|---|---|---|---|
| llama.cpp Metal + MTP | none | 120.3 | 71.4 |
| llama.cpp Metal + MTP | mmproj-BF16.gguf | 297.4 | 72.2 |
The final run with the projector did not show a text-generation slowdown. 最终测试显示,加载投影仪后文本生成速度并未下降。
Setup Instructions / 设置指南
Install llama.cpp / 安装 llama.cpp
brew install cmake git tmux python@3.11
mkdir -p ~/Developer/ML-Models/Gemma4/repos
cd ~/Developer/ML-Models/Gemma4
git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp
cd repos/llama.cpp
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_METAL=ON \
-DGGML_ACCELERATE=ON
cmake --build build --config Release -j
Download the Model Files / 下载模型文件
cd ~/Developer/ML-Models/Gemma4
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U huggingface_hub hf_xet
mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
mmproj-BF16.gguf \
MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf