Build a Unified AI Gateway with LiteLLM and Ollama

使用 LiteLLM 和 Ollama 构建统一的 AI 网关

Unify all your AI models - local and cloud - behind a single OpenAI-compatible API with LiteLLM and Ollama. LiteLLM is a proxy server that exposes 100+ LLM providers through one endpoint. Connect it to Ollama for local inference, and you get load balancing, cost tracking, rate limits, and automatic fallback routing. 通过 LiteLLM 和 Ollama，你可以将所有本地和云端 AI 模型统一在一个兼容 OpenAI 的 API 接口之下。LiteLLM 是一个代理服务器，通过单一端点即可调用 100 多种大语言模型提供商。将其与 Ollama 连接进行本地推理，你即可获得负载均衡、成本追踪、速率限制和自动故障转移路由等功能。

What You Need

准备工作

Python 3.9+
Ollama installed and running
About 20 minutes
Python 3.9+
已安装并运行的 Ollama
约 20 分钟时间

Setup

设置步骤

Install LiteLLM pip install 'litellm[proxy]'
安装 LiteLLM pip install 'litellm[proxy]'
Create config.yaml

model_list:
  - model_name: qwen3-local
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
      rpm: 30
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-your-key

创建 config.yaml 配置文件

model_list:
  - model_name: qwen3-local
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
      rpm: 30
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-your-key

Start the Proxy litellm --config config.yaml --port 4000
启动代理 litellm --config config.yaml --port 4000
Use It

from openai import OpenAI
client = OpenAI(api_key="sk-your-key", base_url="http://localhost:4000/v1")
response = client.chat.completions.create(
    model="qwen3-local",
    messages=[{"role": "user", "content": "Hello!"}])

使用方法

from openai import OpenAI
client = OpenAI(api_key="sk-your-key", base_url="http://localhost:4000/v1")
response = client.chat.completions.create(
    model="qwen3-local",
    messages=[{"role": "user", "content": "Hello!"}])

Key Features

核心功能

Smart fallback - if local model fails, auto-route to cloud
Load balancing - distribute across multiple GPU instances
Cost tracking - per-model spend dashboard
Rate limiting - control requests per user/key
One API - use any tool that supports OpenAI format
智能故障转移 - 若本地模型失败，自动路由至云端
负载均衡 - 在多个 GPU 实例间分配请求
成本追踪 - 提供各模型的支出仪表盘
速率限制 - 控制每个用户或密钥的请求频率
统一 API - 支持任何兼容 OpenAI 格式的工具

Cost vs Cloud

成本对比：LiteLLM + Ollama vs 云端 API

Gateway	Free, self-hosted
Local inference	Free
Model switching	One endpoint
Failover	Automatic

网关	免费，自托管
本地推理	免费
模型切换	单一端点
故障转移	自动

Full guide with advanced config examples: https://everylocalai.com/stack/litellm-ollama-gateway 查看包含高级配置示例的完整指南：https://everylocalai.com/stack/litellm-ollama-gateway