Build a Unified AI Gateway with LiteLLM and Ollama

Build a Unified AI Gateway with LiteLLM and Ollama

使用 LiteLLM 和 Ollama 构建统一的 AI 网关

Unify all your AI models - local and cloud - behind a single OpenAI-compatible API with LiteLLM and Ollama. LiteLLM is a proxy server that exposes 100+ LLM providers through one endpoint. Connect it to Ollama for local inference, and you get load balancing, cost tracking, rate limits, and automatic fallback routing. 通过 LiteLLM 和 Ollama,你可以将所有本地和云端 AI 模型统一在一个兼容 OpenAI 的 API 接口之下。LiteLLM 是一个代理服务器,通过单一端点即可调用 100 多种大语言模型提供商。将其与 Ollama 连接进行本地推理,你即可获得负载均衡、成本追踪、速率限制和自动故障转移路由等功能。

What You Need

准备工作

  • Python 3.9+
  • Ollama installed and running
  • About 20 minutes
  • Python 3.9+
  • 已安装并运行的 Ollama
  • 约 20 分钟时间

Setup

设置步骤

  1. Install LiteLLM pip install 'litellm[proxy]'

  2. 安装 LiteLLM pip install 'litellm[proxy]'

  3. Create config.yaml

model_list:
  - model_name: qwen3-local
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
      rpm: 30
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-your-key
  1. 创建 config.yaml 配置文件
model_list:
  - model_name: qwen3-local
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
      rpm: 30
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-your-key
  1. Start the Proxy litellm --config config.yaml --port 4000

  2. 启动代理 litellm --config config.yaml --port 4000

  3. Use It

from openai import OpenAI
client = OpenAI(api_key="sk-your-key", base_url="http://localhost:4000/v1")
response = client.chat.completions.create(
    model="qwen3-local",
    messages=[{"role": "user", "content": "Hello!"}])
  1. 使用方法
from openai import OpenAI
client = OpenAI(api_key="sk-your-key", base_url="http://localhost:4000/v1")
response = client.chat.completions.create(
    model="qwen3-local",
    messages=[{"role": "user", "content": "Hello!"}])

Key Features

核心功能

  • Smart fallback - if local model fails, auto-route to cloud
  • Load balancing - distribute across multiple GPU instances
  • Cost tracking - per-model spend dashboard
  • Rate limiting - control requests per user/key
  • One API - use any tool that supports OpenAI format
  • 智能故障转移 - 若本地模型失败,自动路由至云端
  • 负载均衡 - 在多个 GPU 实例间分配请求
  • 成本追踪 - 提供各模型的支出仪表盘
  • 速率限制 - 控制每个用户或密钥的请求频率
  • 统一 API - 支持任何兼容 OpenAI 格式的工具

Cost vs Cloud

成本对比:LiteLLM + Ollama vs 云端 API

GatewayFree, self-hosted
Local inferenceFree
Model switchingOne endpoint
FailoverAutomatic
网关免费,自托管
本地推理免费
模型切换单一端点
故障转移自动

Full guide with advanced config examples: https://everylocalai.com/stack/litellm-ollama-gateway 查看包含高级配置示例的完整指南:https://everylocalai.com/stack/litellm-ollama-gateway