Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Ask HN：有人用本地模型完全替代 Claude/GPT 进行日常编程了吗？

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s). 这里有人已经完全用本地模型替代 Claude/GPT 作为主要编程工具，而不仅仅是用于业余实验了吗？如果有，请分享一下你的配置和性能表现（例如每秒生成的 token 数）。

Greenpants: I have! I care about data privacy and LLMs being free. I’m using the Pi coding harness but containerized and sandboxed, to make sure it’s running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I’m using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I’ve done a complete redesign for my website’s homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn’t always know how to develop for Wagtail. I’ve used Qwen3.5 122b for when things get more complex. At 10b active parameters, it’s significantly slower though.

Greenpants： 我做到了！我非常看重数据隐私，也希望 LLM 能免费使用。我正在使用 Pi 编程框架，但将其容器化并放入沙盒中，以确保它完全离线运行。在配备 128GB 内存的 Mac Studio（或 36GB 内存的 MacBook）上，我运行的是 Qwen3.6 35b 模型，仅激活 3b 参数，因此运行速度非常快。我已经用 Django + Wagtail 对我的网站主页和博客进行了彻底的重新设计。Wagtail 的情况很有趣，因为它相对冷门，所以当我不给代理（agent）联网权限时，它并不总是知道如何开发 Wagtail。在处理更复杂的问题时，我会使用 Qwen3.5 122b，但当激活 10b 参数时，它的速度会明显变慢。

I’ve noticed a few things compared to large models like Claude. For starters, you really need to know what you’re asking, and be precise; it doesn’t do much thinking for you. Any assumptions left open, and it’ll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture. It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).

与 Claude 等大型模型相比，我注意到了一些差异。首先，你必须非常清楚自己的需求并保持精确；它不会为你进行太多的深度思考。如果你留有任何假设，它会选择最简单的路径来达成目标（例如直接在 HTML 中写 CSS），这在架构上往往不是最优解。它经常陷入循环，而且令人惊讶的是，它经常会错误地调用编辑工具，之后它会消耗大量的思考 token 并重新读取文件，而不是尝试重试（尽管系统提示词建议它这样做）。

Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it’s completely free, is still mind-boggling to me :)

将智能体化的 Qwen3.6 35b 与 Claude Opus 相比，就像是一个知识面广但需要你时刻指导的初级工程师，而 Opus 则是一位能与你共同思考架构的资深专家。如果说 Opus 能带来 15 倍的效率提升，那么本地且完全离线的 Qwen 能带来 5 倍的提升。考虑到它是完全免费的，这对我来说依然令人难以置信 :)

lambda: This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I’m working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I’m on a Strix Halo 128 GiB unified memory laptop. I’ve never used the frontier models in earnest, I don’t believe in using proprietary tools for my programming, so I can’t really compare. And I’m still a AI skeptic, so I’m doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.

lambda： 这和我的配置非常相似。Pi 运行在容器中（我确实给了它网络访问权限，但不允许访问凭据等敏感信息，仅限于我当前工作的目录和 ~/.pi 目录），并与另一个容器中的 llama.cpp 通信。我使用的是配备 128 GiB 统一内存的 Strix Halo 笔记本电脑。我从未认真使用过那些前沿模型，我不相信在编程中使用专有工具，所以我无法进行真正的比较。而且我仍然是一个 AI 怀疑论者，所以我更多是在进行测试和“试驾”，而不是真正投入使用。这意味着我花了很多时间试图“搞崩”各种模型，探测它们的优缺点等。

But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often. For other chat tasks and translation, I’ll frequently use Gemma 4 31B. For audio, I’ll use Gemma 4 12B. I keep a bunch of other models around to try out every once in a while… but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.

但我发现，当我真正尝试将其用于智能体编程时，Qwen 3.6 35B-A3B 绝对是我最常使用的模型。对于其他聊天任务和翻译，我经常使用 Gemma 4 31B。对于音频任务，我使用 Gemma 4 12B。我保留了许多其他模型偶尔尝试……但到目前为止，Qwen 3.6 35B-A3B 在这样的配置下确实是编程的最佳平衡点。

chakspak: Hopefully this isn’t off-topic, but your setup sounds just like mine, Strix Halo and (I’m assuming) llama.cpp on ROCm, and I’m finding that the Qwen hybrid models don’t handle prompt caching and instead re-process the context in full on every turn. I’m wondering if you were able to solve this and how?

chakspak： 希望这不算跑题，你的配置听起来和我的完全一样，Strix Halo 加上（我猜是）运行在 ROCm 上的 llama.cpp。我发现 Qwen 混合模型无法处理提示词缓存（prompt caching），而是会在每一轮对话中重新处理全部上下文。我想知道你是否解决了这个问题，以及是如何解决的？

lambda: I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it’s not a huge difference, but I’ve been mostly staying on Vulkan. The re-processing context every turn problem is definitely something I’ve hit. Some of the causes have been solved upstream in llama.cpp; make sure you’re up to date. But another cause of the issue that has a big effect is that older Qwen models didn’t support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinking, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.

lambda： 我主要使用 Vulkan 而不是 ROCm。矛盾的是，Vulkan 实际上稍微快一点。我确实会切换并尝试两者，虽然差异不大，但我大部分时间都停留在 Vulkan 上。每一轮重新处理上下文的问题我确实遇到过。其中一些原因已经在 llama.cpp 的上游版本中得到了解决；请确保你的版本是最新的。但导致该问题的另一个重要原因是，旧的 Qwen 模型不支持“保留思考过程”（preserving thinking）。这意味着每当你有一长串带有交替思考的工具调用序列时，一旦进入下一轮对话，它就必须重新处理所有内容，因为它会丢弃之前的推理过程。

Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, because you’re not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time. In my models.ini, I have this for the Qwen3.6 models: chat-template-kwargs = {"preserve_thinking": true}. There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.

然而，Qwen 3.6 现在支持保留思考过程。这可能会占用更多的上下文空间，因为你不会在每一轮都丢弃思考过程，但它能更好地复用缓存，从而避免了每次都必须重新处理整个对话轮次。在我的 models.ini 中，针对 Qwen3.6 模型，我设置了 chat-template-kwargs = {"preserve_thinking": true}。虽然我偶尔仍会遇到需要重新处理的情况，但更新版本并启用 preserve_thinking 已经带来了巨大的改善。