Run a vLLM Server on HF Jobs in One Command

使用单条命令在 Hugging Face Jobs 上运行 vLLM 服务器

You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second. Once it’s up, you can query it from your laptop, a notebook, or anywhere else. It’s the quickest way to stand up a model for tests, evals, or batch generation. (If you’re after a managed, production-ready service instead, that’s what Inference Endpoints are for — more on when to pick which at the end.) Here’s the whole thing end to end.

你只需一条命令，即可在 Hugging Face 基础设施上启动一个私有的、兼容 OpenAI 接口的 LLM 端点——无需配置服务器，无需 Kubernetes，按秒计费。一旦启动，你就可以从笔记本电脑、Notebook 或任何其他地方进行调用。这是进行模型测试、评估或批量生成的最高效方式。（如果你需要的是托管的、生产就绪的服务，请使用 Inference Endpoints——文末会详细说明如何选择。）以下是完整的操作流程。

Prerequisites

前置条件

A payment method or a positive prepaid credit balance (Jobs is billed per‑minute by hardware usage). huggingface_hub >= 1.20.0: pip install -U "huggingface_hub>=1.20.0". Logged in locally: hf auth login.

需要绑定支付方式或拥有充足的预付余额（Jobs 按硬件使用时长每分钟计费）。 huggingface_hub >= 1.20.0：执行 pip install -U "huggingface_hub>=1.20.0"。本地登录：执行 hf auth login。

Launch the server

启动服务器

hf jobs run 就像是 HF 基础设施的 docker run。我们使用官方的 vllm/vllm-openai 镜像，通过 --flavor 指定 GPU，并使用 --expose 暴露 vLLM 的端口：

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

--expose 8000 会通过 HF 的公共 jobs 代理转发容器端口（详细参考请见 Serve Models 指南）。该命令会打印出服务器的访问 URL：

✓ Job started id: 6a381ca1953ed90bfb947332
url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332

Hint: Exposed ports are reachable at (requires an HF token with read access to the job): https://6a381ca1953ed90bfb947332--8000.hf.jobs

提示：暴露的端口可通过以下地址访问（需要拥有该任务读取权限的 HF Token）：https://6a381ca1953ed90bfb947332--8000.hf.jobs

6a381ca1953ed90bfb947332 is your job ID. Keep track of it, we’ll need it. We’ll use <job_id> as a placeholder for it in the rest of the post. Give it a couple of minutes to download weights and boot. When the logs show Application startup complete, you’re live.

6a381ca1953ed90bfb947332 是你的任务 ID。请记下它，后续会用到。在本文余下部分，我们将使用 <job_id> 作为占位符。等待几分钟让模型下载权重并启动。当日志显示 Application startup complete 时，服务即已就绪。

Query it from anywhere

从任何地方进行调用

vLLM 支持 OpenAI API，每个请求只需将你的 HF Token 作为 Bearer Token 传入即可。最快的方法是使用 curl：

curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
  -H "Authorization: Bearer $(hf auth token)" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

该命令会返回标准的 OpenAI 格式 JSON，其中 choices[0].message.content 包含 “Hello! How can I assist you today? 😊“。或者，在 Python 中，将 OpenAI 客户端指向暴露的 URL 并将 Token 作为 API Key 传入：

from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(
    base_url="https://<job_id>--8000.hf.jobs/v1",
    api_key=get_token(),
)

resp = client.chat.completions.create(
    model="Qwen/Qwen3-4B",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)
# Hello! How can I assist you today? 😊

Quick health check before you start: curl https://<job_id>--8000.hf.jobs/v1/models -H "Authorization: Bearer $(hf auth token)" should list the model.

开始前进行快速健康检查：执行 curl https://<job_id>--8000.hf.jobs/v1/models -H "Authorization: Bearer $(hf auth token)" 应该能列出模型信息。

🔐 The endpoint is gated, not public. Every request must carry an HF token with read access to the job’s namespace. A plain browser visit will be rejected. In effect, the jobs proxy is your API gate: access is scoped to you (and your org). That’s fine for private use, but treat the URL accordingly: don’t share it expecting it to be open, and don’t paste your token into untrusted places. If you need finer-grained or public access, put a proper gateway in front instead. Or see “HF Jobs or Inference Endpoints?” below.

🔐 该端点受权限保护，并非公开。每个请求都必须携带具有该任务命名空间读取权限的 HF Token。直接通过浏览器访问将被拒绝。实际上，Jobs 代理就是你的 API 网关：访问权限仅限于你（及你的组织）。这对于私有用途没问题，但请妥善处理该 URL：不要将其分享给他人，也不要将 Token 粘贴到不信任的地方。如果你需要更细粒度的控制或公开访问，请在前端部署专门的网关，或者参考下文的“HF Jobs 还是 Inference Endpoints？”。

Clean up

清理资源

Jobs 按秒计费，所以用完后请停止服务器： hf jobs cancel <job_id>

你设置的 --timeout 是一个安全网（会自动停止），但显式取消会更省钱。a10g-large 的价格为每小时 1.50 美元——请查看 hf jobs hardware 获取完整价格列表，并选择适合你模型的最小规格。

Going further: bigger models

进阶：运行更大的模型

同样的命令可以扩展到更大的模型——选择规格更高的 --flavor，并使用 --tensor-parallel-size 告诉 vLLM 将模型分片到多个 GPU 上。例如，在 2× H200 上运行 122B 的 Qwen3.5 MoE 模型：

hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3.5-122B-A10B \
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
  --max-model-len 32768 --max-num-seqs 256

--tensor-parallel-size 应与规格中的 GPU 数量匹配（h200x2 → 2，h200x8 → 8）。运行 hf jobs hardware 查看可用规格，并为更大的模型设置更长的 --timeout，因为它们下载和加载的时间更长。对于大模型，H200 规格通常性价比最高。

--max-model-len 32768 --max-num-seqs 256 这些参数是该模型特有的：Qwen3.5-122B 采用混合 Mamba/Attention 架构，默认上下文长度为 256K，这会导致 vLLM 的默认批处理设置内存不足。限制上下文长度和并发序列数可以使其保持在 GPU 内存范围内。如果模型因内存不足或缓存块错误而无法启动，调低这两个参数是首选方案。其他所有设置（暴露的 URL、OpenAI 客户端、Token 认证）完全保持不变。