Build a Unified AI Gateway with LiteLLM and Ollama
Build a Unified AI Gateway with LiteLLM and Ollama
使用 LiteLLM 和 Ollama 构建统一的 AI 网关
Unify all your AI models - local and cloud - behind a single OpenAI-compatible API with LiteLLM and Ollama. LiteLLM is a proxy server that exposes 100+ LLM providers through one endpoint. Connect it to Ollama for local inference, and you get load balancing, cost tracking, rate limits, and automatic fallback routing. 通过 LiteLLM 和 Ollama,你可以将所有本地和云端 AI 模型统一在一个兼容 OpenAI 的 API 接口之下。LiteLLM 是一个代理服务器,通过单一端点即可调用 100 多种大语言模型提供商。将其与 Ollama 连接进行本地推理,你即可获得负载均衡、成本追踪、速率限制和自动故障转移路由等功能。
What You Need
准备工作
- Python 3.9+
- Ollama installed and running
- About 20 minutes
- Python 3.9+
- 已安装并运行的 Ollama
- 约 20 分钟时间
Setup
设置步骤
-
Install LiteLLM
pip install 'litellm[proxy]' -
安装 LiteLLM
pip install 'litellm[proxy]' -
Create config.yaml
model_list:
- model_name: qwen3-local
litellm_params:
model: ollama/qwen3:14b
api_base: http://localhost:11434
rpm: 30
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
general_settings:
master_key: sk-your-key
- 创建 config.yaml 配置文件
model_list:
- model_name: qwen3-local
litellm_params:
model: ollama/qwen3:14b
api_base: http://localhost:11434
rpm: 30
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
general_settings:
master_key: sk-your-key
-
Start the Proxy
litellm --config config.yaml --port 4000 -
启动代理
litellm --config config.yaml --port 4000 -
Use It
from openai import OpenAI
client = OpenAI(api_key="sk-your-key", base_url="http://localhost:4000/v1")
response = client.chat.completions.create(
model="qwen3-local",
messages=[{"role": "user", "content": "Hello!"}])
- 使用方法
from openai import OpenAI
client = OpenAI(api_key="sk-your-key", base_url="http://localhost:4000/v1")
response = client.chat.completions.create(
model="qwen3-local",
messages=[{"role": "user", "content": "Hello!"}])
Key Features
核心功能
- Smart fallback - if local model fails, auto-route to cloud
- Load balancing - distribute across multiple GPU instances
- Cost tracking - per-model spend dashboard
- Rate limiting - control requests per user/key
- One API - use any tool that supports OpenAI format
- 智能故障转移 - 若本地模型失败,自动路由至云端
- 负载均衡 - 在多个 GPU 实例间分配请求
- 成本追踪 - 提供各模型的支出仪表盘
- 速率限制 - 控制每个用户或密钥的请求频率
- 统一 API - 支持任何兼容 OpenAI 格式的工具
Cost vs Cloud
成本对比:LiteLLM + Ollama vs 云端 API
| Gateway | Free, self-hosted |
|---|---|
| Local inference | Free |
| Model switching | One endpoint |
| Failover | Automatic |
| 网关 | 免费,自托管 |
|---|---|
| 本地推理 | 免费 |
| 模型切换 | 单一端点 |
| 故障转移 | 自动 |
Full guide with advanced config examples: https://everylocalai.com/stack/litellm-ollama-gateway 查看包含高级配置示例的完整指南:https://everylocalai.com/stack/litellm-ollama-gateway