jundot / omlx

oMLX LLM inference, optimized for your Mac. Continuous batching and tiered KV caching, managed directly from your menu bar. oMLX LLM 推理，专为您的 Mac 优化。支持连续批处理和分层 KV 缓存，并可直接从菜单栏进行管理。

Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar. oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That’s why I built it. 我尝试过的每一个 LLM 服务器都让我不得不在便利性和控制力之间做出选择。我希望将常用的模型固定在内存中，按需自动切换较重的模型，设置上下文限制，并能通过菜单栏管理这一切。oMLX 将 KV 缓存持久化在内存（热层）和 SSD（冷层）中——即使在对话中途更改上下文，所有过去的上下文也会保持缓存并可在请求间重用，这使得本地 LLM 在使用 Claude Code 等工具进行实际编码工作时变得切实可行。这就是我开发它的原因。

Install macOS App

安装 macOS 应用

Download the .dmg from Releases, drag to Applications, done. The app includes in-app auto-update, so future upgrades are just one click. Note that the macOS app does not install the omlx CLI command. For terminal usage, install via Homebrew or from source. 从 Releases 下载 .dmg 文件，拖入“应用程序”文件夹即可。该应用包含内置自动更新功能，未来的升级只需一键完成。请注意，macOS 应用不会安装 omlx 命令行工具。如需在终端使用，请通过 Homebrew 或源码安装。

Homebrew

Homebrew 安装

brew tap jundot/omlx
https://github.com/jundot/omlx
brew install omlx

# Upgrade to the latest version
# 升级到最新版本
brew update && brew upgrade omlx

# Run as a background service (auto-restarts on crash)
# 作为后台服务运行（崩溃后自动重启）
brew services start omlx

# Optional: MCP (Model Context Protocol) support
# 可选：MCP (模型上下文协议) 支持
/opt/homebrew/opt/omlx/libexec/bin/pip install mcp

From Source

从源码安装

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e . # Core only / 仅核心
pip install -e ".[mcp]" # With MCP support / 含 MCP 支持

Requires macOS 15.0+ (Sequoia), Python 3.10+, and Apple Silicon (M1/M2/M3/M4). 需要 macOS 15.0+ (Sequoia)、Python 3.10+ 以及 Apple Silicon (M1/M2/M3/M4) 芯片。

Quickstart

快速入门

macOS App: Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That’s it. To connect OpenClaw, OpenCode, or Codex, see Integrations. macOS 应用： 从“应用程序”文件夹启动 oMLX。欢迎界面会引导您完成三个步骤：设置模型目录、启动服务器以及下载第一个模型。就是这样。要连接 OpenClaw、OpenCode 或 Codex，请参阅“集成”部分。

CLI: omlx serve --model-dir ~/models The server discovers LLMs, VLMs, embedding models, and rerankers from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1. A built-in chat UI is also available at http://localhost:8000/admin/chat. 命令行： omlx serve --model-dir ~/models 服务器会自动从子目录中发现 LLM、VLM、嵌入模型和重排序模型。任何兼容 OpenAI 的客户端都可以连接到 http://localhost:8000/v1。内置的聊天界面也可通过 http://localhost:8000/admin/chat 访问。

Features

功能特性

Supports text LLMs, vision-language models (VLM), OCR models, embeddings, and rerankers on Apple Silicon. 支持 Apple Silicon 上的文本 LLM、视觉语言模型 (VLM)、OCR 模型、嵌入模型和重排序模型。
Admin Dashboard: Web UI at /admin for real-time monitoring, model management, chat, benchmark, and per-model settings. Supports English, Korean, Japanese, Chinese, and Russian. All CDN dependencies are vendored for fully offline operation. 管理仪表板： 提供 /admin Web UI，用于实时监控、模型管理、聊天、基准测试和单模型设置。支持英语、韩语、日语、中文和俄语。所有 CDN 依赖项均已本地化，支持完全离线运行。
Vision-Language Models: Run VLMs with the same continuous batching and tiered KV cache stack as text LLMs. Supports multi-image chat, base64/URL/file image inputs, and tool calling with vision context. OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR) are auto-detected with optimized prompts. 视觉语言模型： 使用与文本 LLM 相同的连续批处理和分层 KV 缓存栈运行 VLM。支持多图聊天、base64/URL/文件图像输入以及带有视觉上下文的工具调用。OCR 模型（DeepSeek-OCR、DOTS-OCR、GLM-OCR）可自动检测并应用优化提示词。
Tiered KV Cache (Hot + Cold): Block-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers: Hot tier (RAM) for frequently accessed blocks, and Cold tier (SSD) for offloaded blocks. Restored from disk instead of recomputed, even after server restart. 分层 KV 缓存（热 + 冷）： 受 vLLM 启发，采用基于块的 KV 缓存管理，支持前缀共享和写时复制 (Copy-on-Write)。缓存分为两层：热层（RAM）用于频繁访问的块，冷层（SSD）用于卸载的块。即使在服务器重启后，也能从磁盘恢复而非重新计算。
Continuous Batching: Handles concurrent requests through mlx-lm’s BatchGenerator. 连续批处理： 通过 mlx-lm 的 BatchGenerator 处理并发请求。
Claude Code Optimization: Context scaling support for running smaller context models with Claude Code. Claude Code 优化： 支持上下文缩放，以便在 Claude Code 中运行较小上下文的模型。
Multi-Model Serving: Load LLMs, VLMs, embedding models, and rerankers within the same server. Includes LRU eviction, manual load/unload, model pinning, and per-model TTL. 多模型服务： 在同一服务器内加载 LLM、VLM、嵌入模型和重排序模型。包含 LRU 淘汰机制、手动加载/卸载、模型固定以及单模型 TTL（生存时间）设置。
Per-Model Settings: Configure sampling parameters, chat template kwargs, TTL, model alias, and more directly from the admin panel. 单模型设置： 直接从管理面板配置采样参数、聊天模板参数、TTL、模型别名等。
Built-in Chat: Chat directly with any loaded model from the admin panel. 内置聊天： 直接从管理面板与任何已加载的模型进行对话。
Model Downloader: Search and download MLX models from HuggingFace directly in the admin dashboard. 模型下载器： 直接在管理仪表板中搜索并下载 HuggingFace 上的 MLX 模型。
Integrations: Set up OpenClaw, OpenCode, Codex, and Pi directly from the admin dashboard with a single click. 集成： 从管理仪表板一键设置 OpenClaw、OpenCode、Codex 和 Pi。
Performance Benchmark: One-click benchmarking from the admin panel. 性能基准测试： 从管理面板进行一键基准测试。