DeepSeek 4 Flash local inference engine for Metal

ds4.c is a small native inference engine for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime, and not a framework. The main path is a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state, and server API glue. This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.

ds4.c 是一个专为 DeepSeek V4 Flash 设计的小型原生推理引擎。它的定位非常明确且单一：它不是通用的 GGUF 运行器，不是其他运行时的封装，也不是一个框架。其核心路径是一个针对 DeepSeek V4 Flash 的 Metal 图执行器，包含了 DS4 特有的加载、提示词渲染、KV 状态管理以及服务器 API 接口。如果没有 llama.cpp 和 GGML，这个项目就不会存在，请务必阅读致谢部分，非常感谢 Georgi Gerganov 和所有其他贡献者。

Now, back at this project. Why we believe DeepSeek v4 Flash to be a pretty special model deserving a stand alone engine? Because after comparing it with powerful smaller dense models, we can report that: DeepSeek v4 Flash is faster because of less active parameters. In thinking mode, if you avoid max thinking, it produces a thinking section that is a lot shorter than other models, even 1/5 of other models in many cases, and crucially, the thinking section length is proportional to the problem complexity. This makes DeepSeek v4 Flash usable with thinking enabled when other models are practically impossible to use in the same conditions.

回到这个项目。为什么我们认为 DeepSeek v4 Flash 是一个非常特别、值得拥有独立引擎的模型？在将其与强大的小型稠密模型进行比较后，我们可以得出结论：DeepSeek v4 Flash 因为活跃参数较少，所以速度更快。在思考模式下，如果你不强制使用最大思考长度，它生成的思考部分比其他模型短得多，在许多情况下甚至只有其他模型的 1/5。更关键的是，其思考部分的长度与问题的复杂度成正比。这使得 DeepSeek v4 Flash 在开启思考模式时依然可用，而其他模型在相同条件下几乎无法使用。

The model features a context window of 1 million tokens. Being so large, it knows more things if you go sampling at the edge of knowledge. For instance asking about Italian show or political questions soon uncovers that 284B parameters are a lot more than 27B or 35B parameters. It writes much better English and Italian. It feels a quasi-frontier model. The KV cache is incredibly compress, allowing long context inference on local computers and on disk KV cache persistence. It works well with 2-bit quantization, if quantized in a special way (read later). This allows to run it in MacBooks with 128GB of RAM. We expect DeepSeek to release updated versions of v4 Flash in the future, even better than the current one.

该模型拥有 100 万 token 的上下文窗口。由于规模如此之大，当你在知识边缘进行采样时，它能掌握更多信息。例如，询问意大利节目或政治问题时，很快就能发现 284B 参数远超 27B 或 35B 参数的表现。它能写出更好的英语和意大利语，感觉像是一个准前沿模型。其 KV 缓存压缩率极高，允许在本地计算机上进行长上下文推理，并支持 KV 缓存的磁盘持久化。如果采用特殊方式量化（详见后文），它在 2-bit 量化下表现良好。这使得它可以在拥有 128GB 内存的 MacBook 上运行。我们预计 DeepSeek 未来会发布比当前版本更好的 v4 Flash 更新版本。

That said, a few important things about this project: The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation (logits obtained with the official implementation), long-context tests, and enough agent integration to know if it really works. The exact model may change as the landscape evolves, but the constraint remains: local inference credible on high end personal machines or Mac Studios, starting from 128GB of memory.

话虽如此，关于本项目有几点重要说明：本地推理领域有很多优秀的项目，但新模型层出不穷，人们的注意力总是迅速被下一个待实现的新模型吸引。本项目采取了一种刻意收窄的策略：一次只专注于一个模型，进行官方向量验证（使用官方实现获取的 logits）、长上下文测试，以及足够的智能体集成以验证其真实可用性。随着领域的发展，具体模型可能会改变，但约束条件不变：即在 128GB 内存起步的高端个人电脑或 Mac Studio 上实现可信的本地推理。

This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you. The acknowledgement below is equally important: this would not exist without llama.cpp and GGML, largely written by hand. This implementation is based on the idea that compressed KV caches like the one of DeepSeek v4 and the fast SSD disks of modern MacBooks should change our idea that KV cache belongs to RAM. The KV cache It is actually a first class disk citizen.

本软件在 GPT 5.5 的强力辅助下开发，并由人类主导构思、测试和调试。我们公开这一点，因为它决定了项目的构建方式。如果你对 AI 开发的代码不满意，那么本软件不适合你。下方的致谢同样重要：如果没有 llama.cpp 和 GGML，这个项目就不会存在，其中大部分代码是手工编写的。本实现基于这样一个理念：像 DeepSeek v4 那样的压缩 KV 缓存，结合现代 MacBook 的高速 SSD，应该改变我们认为“KV 缓存必须驻留内存”的固有观念。KV 缓存实际上可以成为磁盘的一等公民。

Our vision is that local inference should be a set of three things working well together, out of the box: A) inference engine with HTTP API + B) GGUF specially crafted to run well under a given engine and given assumptions + C) testing and validation with coding agents implementations. This inference engine only runs with the GGUF files provided. It gets tested against officially obtained logits at different context sizes. This project exists because we wanted to make one local model feel finished end to end, not just runnable. However this is just alpha quality code, so probably we are not still there.

我们的愿景是，本地推理应该是一套开箱即用、协同良好的三要素组合：A) 带有 HTTP API 的推理引擎 + B) 专门为特定引擎和假设而精心制作的 GGUF 模型 + C) 基于编码智能体实现的测试与验证。该推理引擎仅支持所提供的 GGUF 文件，并针对不同上下文长度下官方获取的 logits 进行测试。本项目存在的初衷是想让一个本地模型不仅是“可运行”，而是达到“端到端完成”的状态。不过，这目前仅是 Alpha 质量的代码，所以我们可能还没完全达到目标。

This is Metal-only, may implement CUDA support in the future? Perhaps, but nothing more. The CPU path is only for correctness check, but warning: current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code. Remember? Software sucks. I was not possible to fix the CPU inference to avoid crashing, since each time there is to restart the computer, which is not funny. Help us, if you have the guts.

目前仅支持 Metal，未来可能会实现 CUDA 支持？也许吧，但仅此而已。CPU 路径仅用于正确性检查，但请注意：当前的 macOS 版本在虚拟内存实现中存在一个 Bug，如果你尝试运行 CPU 代码，会导致内核崩溃。还记得吗？软件总是很烂。我无法修复 CPU 推理导致的崩溃，因为每次崩溃都需要重启电脑，这可一点也不好玩。如果你有胆量，欢迎来帮助我们。

Acknowledgements to llama.cpp and GGML

致谢 llama.cpp 和 GGML

ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are thankful and indebted to llama.cpp and its contributors. Their implementation, kernels, tests, and design choices were an essential reference while building this DeepSeek V4 Flash-specific inference path. Some source-level pieces are retained or adapted here under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain Metal kernels. For this reason, and because we are genuinely grateful, we keep the GGML authors copyright notice in our LICENSE file.

ds4.c 没有链接到 GGML，但它的存在归功于 llama.cpp 项目所开辟的道路，以及在那里开发的内核、量化格式、GGUF 生态系统和来之不易的工程知识。我们感谢并欠 llama.cpp 及其贡献者一份情。在构建这条 DeepSeek V4 Flash 专用推理路径时，他们的实现、内核、测试和设计选择是至关重要的参考。部分源代码在 MIT 许可下被保留或适配于此：包括 GGUF 量化布局和表格、CPU 量化/点积逻辑以及某些 Metal 内核。因此，也出于我们真诚的感激，我们在 LICENSE 文件中保留了 GGML 作者的版权声明。

Model Weights

模型权重

This implementation only works with the DeepSeek V4 Flash GGUFs published for this project. It is not a general GGUF loader, and arbitrary DeepSeek/GGUF files will not have the tensor layout, quantization mix, metadata, or optional MTP state expected by the engine. The 2 bit quantizations provided here are not a joke: they behave well, work under coding agents, call tools in a reliable way. The 2 bit quants use a very asymmetrical quantization: only the routed MoE experts are quantized, up/gate at IQ2_XXS, down at Q2_K. They are the majority of all the model space: the other components (shared experts, projections, routing) are left untouched to guarantee quality.

本实现仅适用于为本项目发布的 DeepSeek V4 Flash GGUF 文件。它不是通用的 GGUF 加载器，任意的 DeepSeek/GGUF 文件将不具备引擎所预期的张量布局、量化组合、元数据或可选的 MTP 状态。此处提供的 2-bit 量化并非儿戏：它们表现良好，在编码智能体下运行正常，并能可靠地调用工具。2-bit 量化采用了非常不对称的量化方式：仅对路由的 MoE 专家进行量化，up/gate 层使用 IQ2_XXS，down 层使用 Q2_K。它们占据了模型空间的大部分：其他组件（共享专家、投影层、路由）保持不变以确保质量。

Download one main model: 下载一个主模型：

./download_model.sh q2 # 128 GB RAM machines
./download_model.sh q4 # >= 256 GB RAM machines

The script downloads from https://huggingface.co/antirez/deepseek-v4-gguf, stores files under ./gguf/, resumes partial downloads with curl -C -, and updates ./ds4flash.gguf to point at the selected q2/q4 model. Authentication is optional for public downloads.

该脚本从 https://huggingface.co/antirez/deepseek-v4-gguf 下载，将文件存储在 ./gguf/ 下，使用 curl -C - 恢复部分下载，并更新 ./ds4flash.gguf 以指向所选的 q2/q4 模型。对于公共下载，身份验证是可选的。