jundot / omlx

jundot / omlx

oMLX LLM inference, optimized for your Mac. Continuous batching and tiered KV caching, managed directly from your menu bar. oMLX LLM 推理,专为您的 Mac 优化。支持连续批处理和分层 KV 缓存,并可直接从菜单栏进行管理。

Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar. oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That’s why I built it. 我尝试过的每一个 LLM 服务器都让我不得不在便利性和控制力之间做出选择。我希望将常用的模型固定在内存中,按需自动切换较重的模型,设置上下文限制,并能通过菜单栏管理这一切。oMLX 将 KV 缓存持久化在内存(热层)和 SSD(冷层)中——即使在对话中途更改上下文,所有过去的上下文也会保持缓存并可在请求间重用,这使得本地 LLM 在使用 Claude Code 等工具进行实际编码工作时变得切实可行。这就是我开发它的原因。

Install macOS App

安装 macOS 应用

Download the .dmg from Releases, drag to Applications, done. The app includes in-app auto-update, so future upgrades are just one click. Note that the macOS app does not install the omlx CLI command. For terminal usage, install via Homebrew or from source. 从 Releases 下载 .dmg 文件,拖入“应用程序”文件夹即可。该应用包含内置自动更新功能,未来的升级只需一键完成。请注意,macOS 应用不会安装 omlx 命令行工具。如需在终端使用,请通过 Homebrew 或源码安装。

Homebrew

Homebrew 安装

brew tap jundot/omlx
https://github.com/jundot/omlx
brew install omlx

# Upgrade to the latest version
# 升级到最新版本
brew update && brew upgrade omlx

# Run as a background service (auto-restarts on crash)
# 作为后台服务运行(崩溃后自动重启)
brew services start omlx

# Optional: MCP (Model Context Protocol) support
# 可选:MCP (模型上下文协议) 支持
/opt/homebrew/opt/omlx/libexec/bin/pip install mcp

From Source

从源码安装

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e . # Core only / 仅核心
pip install -e ".[mcp]" # With MCP support / 含 MCP 支持

Requires macOS 15.0+ (Sequoia), Python 3.10+, and Apple Silicon (M1/M2/M3/M4). 需要 macOS 15.0+ (Sequoia)、Python 3.10+ 以及 Apple Silicon (M1/M2/M3/M4) 芯片。

Quickstart

快速入门

macOS App: Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That’s it. To connect OpenClaw, OpenCode, or Codex, see Integrations. macOS 应用: 从“应用程序”文件夹启动 oMLX。欢迎界面会引导您完成三个步骤:设置模型目录、启动服务器以及下载第一个模型。就是这样。要连接 OpenClaw、OpenCode 或 Codex,请参阅“集成”部分。

CLI: omlx serve --model-dir ~/models The server discovers LLMs, VLMs, embedding models, and rerankers from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1. A built-in chat UI is also available at http://localhost:8000/admin/chat. 命令行: omlx serve --model-dir ~/models 服务器会自动从子目录中发现 LLM、VLM、嵌入模型和重排序模型。任何兼容 OpenAI 的客户端都可以连接到 http://localhost:8000/v1。内置的聊天界面也可通过 http://localhost:8000/admin/chat 访问。

Features

功能特性

  • Supports text LLMs, vision-language models (VLM), OCR models, embeddings, and rerankers on Apple Silicon. 支持 Apple Silicon 上的文本 LLM、视觉语言模型 (VLM)、OCR 模型、嵌入模型和重排序模型。
  • Admin Dashboard: Web UI at /admin for real-time monitoring, model management, chat, benchmark, and per-model settings. Supports English, Korean, Japanese, Chinese, and Russian. All CDN dependencies are vendored for fully offline operation. 管理仪表板: 提供 /admin Web UI,用于实时监控、模型管理、聊天、基准测试和单模型设置。支持英语、韩语、日语、中文和俄语。所有 CDN 依赖项均已本地化,支持完全离线运行。
  • Vision-Language Models: Run VLMs with the same continuous batching and tiered KV cache stack as text LLMs. Supports multi-image chat, base64/URL/file image inputs, and tool calling with vision context. OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR) are auto-detected with optimized prompts. 视觉语言模型: 使用与文本 LLM 相同的连续批处理和分层 KV 缓存栈运行 VLM。支持多图聊天、base64/URL/文件图像输入以及带有视觉上下文的工具调用。OCR 模型(DeepSeek-OCR、DOTS-OCR、GLM-OCR)可自动检测并应用优化提示词。
  • Tiered KV Cache (Hot + Cold): Block-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers: Hot tier (RAM) for frequently accessed blocks, and Cold tier (SSD) for offloaded blocks. Restored from disk instead of recomputed, even after server restart. 分层 KV 缓存(热 + 冷): 受 vLLM 启发,采用基于块的 KV 缓存管理,支持前缀共享和写时复制 (Copy-on-Write)。缓存分为两层:热层(RAM)用于频繁访问的块,冷层(SSD)用于卸载的块。即使在服务器重启后,也能从磁盘恢复而非重新计算。
  • Continuous Batching: Handles concurrent requests through mlx-lm’s BatchGenerator. 连续批处理: 通过 mlx-lm 的 BatchGenerator 处理并发请求。
  • Claude Code Optimization: Context scaling support for running smaller context models with Claude Code. Claude Code 优化: 支持上下文缩放,以便在 Claude Code 中运行较小上下文的模型。
  • Multi-Model Serving: Load LLMs, VLMs, embedding models, and rerankers within the same server. Includes LRU eviction, manual load/unload, model pinning, and per-model TTL. 多模型服务: 在同一服务器内加载 LLM、VLM、嵌入模型和重排序模型。包含 LRU 淘汰机制、手动加载/卸载、模型固定以及单模型 TTL(生存时间)设置。
  • Per-Model Settings: Configure sampling parameters, chat template kwargs, TTL, model alias, and more directly from the admin panel. 单模型设置: 直接从管理面板配置采样参数、聊天模板参数、TTL、模型别名等。
  • Built-in Chat: Chat directly with any loaded model from the admin panel. 内置聊天: 直接从管理面板与任何已加载的模型进行对话。
  • Model Downloader: Search and download MLX models from HuggingFace directly in the admin dashboard. 模型下载器: 直接在管理仪表板中搜索并下载 HuggingFace 上的 MLX 模型。
  • Integrations: Set up OpenClaw, OpenCode, Codex, and Pi directly from the admin dashboard with a single click. 集成: 从管理仪表板一键设置 OpenClaw、OpenCode、Codex 和 Pi。
  • Performance Benchmark: One-click benchmarking from the admin panel. 性能基准测试: 从管理面板进行一键基准测试。