Running Gemma 4 on a Modest Machine: Unsloth vs LM Studio vs llama.cpp vs Ollama

在普通机器上运行 Gemma 4：Unsloth、LM Studio、llama.cpp 与 Ollama 的对比

Gemma 4 Challenge: Write about Gemma 4 Submission This is a submission for the Gemma 4 Challenge: Write About Gemma 4 When local AI conversations happen online, they tend to sound like this: “I ran the 70B model on my dual-GPU workstation.” or “You only need 64GB RAM and a 24GB graphics card.” Meanwhile, I’m sitting with an Intel i5, 16GB RAM, integrated graphics, roughly 350GB of storage, and no monster GPU hiding under my desk. That made me curious. If I wanted to build something with Gemma 4 locally, which stack actually makes sense on hardware that most developers realistically own? So I looked at four names that keep coming up: Unsloth, LM Studio, llama.cpp, and Ollama. At first they looked like competing products. After spending time with them, I realised they solve different parts of the same problem.

Gemma 4 挑战赛：关于 Gemma 4 的投稿。这是我为 Gemma 4 挑战赛提交的文章。当人们在网上讨论本地 AI 时，对话往往是这样的：“我在我的双 GPU 工作站上运行了 70B 模型。”或者“你只需要 64GB 内存和 24GB 显存的显卡。”与此同时，我手里只有一台 Intel i5、16GB 内存、集成显卡、大约 350GB 存储空间的电脑，桌下也没有什么怪兽级 GPU。这让我很好奇：如果我想在本地用 Gemma 4 开发点东西，对于大多数开发者实际拥有的硬件来说，什么样的技术栈才真正合理？于是我研究了四个经常被提及的名字：Unsloth、LM Studio、llama.cpp 和 Ollama。起初，它们看起来像是竞争产品，但在深入使用后，我意识到它们其实是在解决同一个问题的不同部分。

The first lesson: these tools aren’t really competitors My initial assumption was simple. Pick one, ignore the others. But they fit together more like a pipeline: Model fine-tuning → Unsloth Inference engine → llama.cpp Serving layer → Ollama Desktop UI → LM Studio Rather than replacing each other, they stack. In fact, LM Studio and Ollama both use llama.cpp under the hood. You don’t necessarily need to install llama.cpp separately unless you want direct, low-level control over quantization or server flags.

第一课：这些工具并非真正的竞争对手。我最初的想法很简单：选一个，忽略其他。但它们更像是一个流水线：模型微调 → Unsloth；推理引擎 → llama.cpp；服务层 → Ollama；桌面 UI → LM Studio。它们不是相互替代，而是相互叠加。事实上，LM Studio 和 Ollama 的底层都使用了 llama.cpp。除非你需要对量化或服务器标志进行直接的底层控制，否则你不一定需要单独安装 llama.cpp。

Unsloth: fine-tuning without the anxiety Fine-tuning usually sounds expensive. Huge GPUs, large memory requirements, long training runs. Unsloth tries to cut that cost significantly. Would I train a large Gemma variant on my setup? Probably not. But smaller experiments and LoRA fine-tuning on the E2B or E4B models feel a lot less out of reach. The interesting thing about Unsloth isn’t just the speed gains. It’s that it makes the whole process feel less like something only research labs do. That said, on a CPU-only machine, even small fine-tuning jobs are slow. For anything beyond a quick experiment, I’d probably train in a free Google Colab session with a T4 GPU, then export the resulting GGUF to run locally.

Unsloth：告别焦虑的微调。微调通常听起来很昂贵：巨大的 GPU、高内存需求、漫长的训练周期。Unsloth 试图显著降低这些成本。我会在我的设备上训练大型 Gemma 变体吗？大概不会。但对于 E2B 或 E4B 模型的小型实验和 LoRA 微调，感觉就没那么遥不可及了。Unsloth 有趣的地方不仅在于速度提升，还在于它让整个过程看起来不再是只有研究实验室才能做的事。话虽如此，在纯 CPU 机器上，即使是小规模的微调任务也很慢。如果不是简单的实验，我可能会选择在带有 T4 GPU 的免费 Google Colab 会话中进行训练，然后导出生成的 GGUF 文件在本地运行。

LM Studio: the least intimidating place to start LM Studio removes almost all the friction. Download it, pick a model, run it, start testing. For a machine like mine, that matters. The tradeoffs are real though. Larger models hit hardware limits quickly, and you have less control than you’d get with lower-level tools. But if someone asked me where to start if they’ve never run a local model before, LM Studio would be my first recommendation.

LM Studio：最容易上手的起点。LM Studio 几乎消除了所有障碍。下载、选模型、运行、开始测试。对于像我这样的机器来说，这很重要。当然，权衡是必然的：大型模型会迅速触及硬件极限，而且你对模型的控制力也不如底层工具。但如果有人问我，从未运行过本地模型的人该从哪里开始，LM Studio 将是我的首选推荐。

llama.cpp: the engine quietly powering everything llama.cpp isn’t flashy. No polished interface, no big buttons. But it shows up everywhere, and for good reason. The smallest Gemma 4 model needs roughly 4GB of RAM at Q4 quantization, and the largest can push to around 20GB. On a 16GB machine, that headroom matters. Quantized models running through llama.cpp are often what makes local AI possible on hardware that would otherwise be too constrained. Without that kind of optimization, things get difficult fast.

llama.cpp：默默驱动一切的引擎。llama.cpp 并不花哨，没有精美的界面，也没有大按钮。但它无处不在，这是有原因的。最小的 Gemma 4 模型在 Q4 量化下大约需要 4GB 内存，而最大的模型可能需要 20GB 左右。在 16GB 的机器上，这种内存余量至关重要。通过 llama.cpp 运行的量化模型，往往是让本地 AI 在受限硬件上运行的关键。没有这种优化，事情很快就会变得寸步难行。

Ollama: local AI that feels like infrastructure Ollama was the tool that clicked immediately. ollama run gemma4:e4b That simplicity changes your relationship with the whole thing. Instead of spending time managing files and configs, you spend time building. When you’re working with FastAPI, Django, LangChain, or agent systems, Ollama starts feeling less like software and more like infrastructure you just trust to be there.

Ollama：像基础设施一样的本地 AI。Ollama 是让我立刻产生共鸣的工具。ollama run gemma4:e4b。这种简洁性改变了你与整个系统的关系。你不再需要花时间管理文件和配置，而是专注于构建。当你使用 FastAPI、Django、LangChain 或智能体系统时，Ollama 给人的感觉不再是软件，而是一种你可以信赖的基础设施。

What I’d actually run on my machine Gemma 4 comes in four sizes: E2B, E4B, the 26B MoE model, and the 31B dense model. Given my hardware, the 26B and 31B variants are effectively off the table unless I want to tolerate heavy disk offloading and painful slowdowns. The E2B and E4B models are specifically designed for edge and on-device deployment, which makes them the realistic options here. Quantized versions where possible. My stack would look like this: Experimentation: LM Studio; Application serving: Ollama; Optimized inference: llama.cpp (when I need direct control); Fine-tuning experiments: Unsloth.

我会在我的机器上运行什么？Gemma 4 有四种尺寸：E2B、E4B、26B MoE 模型和 31B 稠密模型。考虑到我的硬件，26B 和 31B 变体基本上是不可能的，除非我能忍受大量的磁盘交换和痛苦的卡顿。E2B 和 E4B 模型是专门为边缘和端侧部署设计的，这使它们成为现实的选择。尽可能使用量化版本。我的技术栈如下：实验：LM Studio；应用服务：Ollama；优化推理：llama.cpp（当我需要直接控制时）；微调实验：Unsloth。

The RAM reality check Can you install all four on a 16GB machine? Yes. Can you run them all simultaneously while hosting a model? No. Loading an LLM into RAM is exclusive. You can’t have LM Studio and Ollama both holding a 6GB model in memory at the same time and still leave headroom for your OS and browser. The practical workflow is switching between them: experiment in LM Studio, shut it down, then serve via Ollama when you’re building.

内存的现实检验。你能在 16GB 的机器上安装这四个工具吗？可以。你能在托管模型的同时同时运行它们吗？不行。将 LLM 加载到内存中是排他的。你不能让 LM Studio 和 Ollama 同时在内存中加载 6GB 的模型，还要为操作系统和浏览器留出空间。实际的工作流程是切换使用：在 LM Studio 中进行实验，关闭它，然后在构建时通过 Ollama 提供服务。

What I actually took away from this The most useful discovery wasn’t which tool is best. It was realising that local AI is becoming less about raw硬件 (hardware) and more about the tooling around it. I am building an EdgeTutor for kids in rural classroom in South Africa. It is an application that helps teachers be able to help kids with tailored knowledge of their needs. Models like Gemma 4 makes this possible as they run on small computing resources. A few years ago, a machine like mine wouldn’t really be part of the conversation. The smaller Gemma 4 models are specifically designed for efficient local execution on laptops and mobile devices, which means developers who aren’t sitting on workstation hardware can genuinely participate now. Maybe not with the biggest models. But enough to build. And sometimes that is all you need.

我真正的收获。最有用的发现并不是哪个工具最好，而是意识到本地 AI 越来越不再依赖原始硬件，而是更多地依赖于围绕它的工具链。我正在为南非农村教室的孩子们构建一个“EdgeTutor”。这是一个帮助教师根据孩子需求提供定制化知识的应用程序。像 Gemma 4 这样的模型使其成为可能，因为它们可以在小型计算资源上运行。几年前，像我这样的机器根本无法参与其中。较小的 Gemma 4 模型是专门为在笔记本电脑和移动设备上高效本地执行而设计的，这意味着没有工作站硬件的开发者现在也可以真正参与进来。也许不是最大的模型，但足以进行构建。而有时，这正是你所需要的。