Running local models on an M4 with 24GB memory

在配备 24GB 内存的 M4 芯片上运行本地模型

I’ve been experimenting with running local models on and off for a bit and I’ve finally found a setup that seems to work reasonably. It’s nothing like the output of a SOTA model, but the excitement of being able to have a local model do basic tasks, research, and planning, more than makes up for it! No internet connection required! Not to mention that it’s a way of reducing your dependence on big US tech, even if just a tiny bit.

我断断续续地尝试运行本地模型有一段时间了，终于找到了一种看起来还算合理的配置。虽然它的输出效果无法与最先进（SOTA）的模型相比，但能够让本地模型执行基础任务、进行研究和规划所带来的兴奋感，足以弥补这一点！而且无需联网！更不用说，这还是减少对美国大型科技公司依赖的一种方式，哪怕只是一点点。

I gotta say though, it’s not easy to get this stuff set up. First you have to choose how you’re running the model: Ollama, llama.cpp or LM Studio. Each one comes with its own quirks and limitations, and they don’t offer all the same models. Then of course, you have to pick your model. You want the best model available that fits in memory and still gives you enough headroom to run your regular assortment of Electron apps, not to mention something where you can have at least a 64K context window, but ideally 128K or more.

不过我必须得说，配置这些东西并不容易。首先，你得选择运行模型的方式：Ollama、llama.cpp 还是 LM Studio。每种工具都有其独特的怪癖和局限性，而且它们提供的模型也不尽相同。当然，接下来你还得挑选模型。你需要一个既能装进内存，又能留出足够空间来运行你日常使用的各种 Electron 应用的最佳模型，更不用说还得满足至少 64K 的上下文窗口，理想情况下是 128K 或更多。

Most recently I’ve tried Qwen 3.6 Q3, GPT-OSS 20B, Devstral Small 24B, which all technically fit in memory but were in practice unusable, and Gemma 4B that would run fine but really struggle with tool use. Then there’s a plethora of configuration options to tweak. From the more well-known, like temperature, to more esoteric options like K Cache Quantization Type. Many of these tools come with a basic recommended set of options, but the appropriate ones can depend on things like whether you’re enabling thinking or not!

最近我尝试了 Qwen 3.6 Q3、GPT-OSS 20B 和 Devstral Small 24B，它们在技术上都能装进内存，但实际上根本无法使用；我还试过 Gemma 4B，它运行起来没问题，但在工具调用方面表现非常吃力。此外，还有大量的配置选项需要调整。从众所周知的“温度”（temperature）到像“K Cache 量化类型”这样晦涩的选项。许多工具都提供了一套基础的推荐设置，但合适的参数往往取决于你是否启用了“思考”（thinking）模式等因素！

Qwen 3.5-9B (4b quant) qwen3.5-9b@q4_k_s (HuggingFace link) is the best model I’ve gotten working with a reasonable ~40 tokens per second, thinking enabled, successful tool use, and a 128K context window, running on LM Studio. Compared to a SOTA model, it gets distracted more easily, sometimes it gets stuck in loops, it’ll misinterpret asks etc. But it’s surprisingly good for something that can run on a 24GB Macbook Pro while leaving space for lots of other things running too!

Qwen 3.5-9B (4b 量化) qwen3.5-9b@q4_k_s 是我目前运行效果最好的模型，在 LM Studio 上运行，速度可达约 40 tokens/秒，启用了思考模式，工具调用成功，且支持 128K 上下文窗口。与 SOTA 模型相比，它更容易分心，有时会陷入循环，或者误解指令等。但对于一个能在 24GB 内存的 MacBook Pro 上运行，同时还能留出空间运行其他大量程序的模型来说，它的表现令人惊喜！

These are the recommended settings for thinking mode and coding work: Thinking mode for precise coding tasks (e.g., WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0. To enable thinking I also had to select the model, go to configuration, scroll to the bottom of the Inference tab, and add {%- set enable_thinking = true %} to the Prompt Template.

以下是思考模式和编程工作的推荐设置：用于精确编程任务（如 Web 开发）的思考模式：temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0。为了启用思考功能，我还必须选中模型，进入配置，滚动到“推理”（Inference）选项卡底部，并在提示词模板（Prompt Template）中添加 {%- set enable_thinking = true %}。

I’ve been using it through both pi and OpenCode. I still haven’t quite made my mind up on with one I prefer. Pi feels a bit snappier, but although I really appreciate the idea of the harness building itself and all that customization, I can’t help but wish it came with some sensible defaults. I feel like you could easily end up spending more time tweaking your pi set up to be just right, than you do on your actual projects!

我一直在通过 pi 和 OpenCode 使用它。我还没决定更喜欢哪一个。Pi 感觉响应更快一些，虽然我很欣赏它那种构建框架和高度自定义的理念，但我还是希望它能提供一些合理的默认设置。我觉得你很容易在调整 pi 的配置上花费比实际项目更多的时间！

(Note: The article continues with technical configuration snippets for pi and OpenCode, followed by a discussion on the workflow differences between local models and SOTA models, and a practical example of using the model for code linting.)

(注：原文后续包含 pi 和 OpenCode 的技术配置代码片段，随后讨论了本地模型与 SOTA 模型在工作流上的差异，并提供了一个使用该模型进行代码检查的实际案例。)