Andyyyy64 / whichllm

whichllm helps you find the best local LLM that actually runs on your hardware. It auto-detects your GPU/CPU/RAM and ranks the top models from HuggingFace that fit your system. whichllm 是一款能帮你找到真正能在你的硬件上运行的最佳本地大模型（LLM）的工具。它会自动检测你的 GPU、CPU 和内存（RAM），并从 HuggingFace 上筛选出最适合你系统的顶级模型。

Quick start / 快速开始

Run the recommendation command once, with no project setup. 无需配置项目，直接运行推荐命令即可： uvx whichllm@latest

Simulate a GPU before you buy hardware. 在购买硬件前模拟 GPU 性能： uvx whichllm@latest --gpu "RTX 4090"

Install it when you use it often. 如果你经常使用，可以安装它：

uv tool install whichllm
uv tool upgrade whichllm # update an existing install

Other install paths: 其他安装方式：

brew install andyyyy64/whichllm/whichllm
pip install whichllm

Common workflows / 常用工作流

After install, run whichllm directly. For one-off runs, replace whichllm with uvx whichllm@latest. 安装后，可以直接运行 whichllm。如果是单次运行，请将 whichllm 替换为 uvx whichllm@latest。

# Best models for this machine / 为当前机器推荐最佳模型
whichllm

# Pretend you have a specific GPU / 模拟特定 GPU
whichllm --gpu "RTX 4090"

# Compare upgrade candidates / 比较升级方案
whichllm upgrade "RTX 4090" "RTX 5090" "H100"

# Find the GPU needed for a model / 查询运行特定模型所需的 GPU
whichllm plan "llama 3 70b"

# Start a chat with a model / 与模型开启对话
whichllm run "qwen 2.5 1.5b gguf"

# Print copy-paste Python snippet / 打印可直接复制的 Python 代码片段
whichllm snippet "qwen 7b"

# Return JSON for scripts / 返回 JSON 格式供脚本调用
whichllm --top 1 --json

See it / 效果预览

$ whichllm --gpu "RTX 4090"
#1 Qwen/Qwen3.6-27B 27.8B Q5_K_M score 92.8 27 t/s
#2 Qwen/Qwen3-32B 32.0B Q4_K_M score 83.0 31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B Q5_K_M score 82.7 102 t/s

The 32B model fits your card fine — whichllm still ranks the 27B #1, because it scores higher on real benchmarks and is a newer generation. A size-only “what fits?” tool would hand you the bigger one. That gap is the whole point of whichllm. (Note #3: a MoE model at 102 t/s — speed is ranked on active params, quality on total.) 虽然 32B 模型完全塞得进你的显卡，但 whichllm 依然将 27B 模型排在第一位，因为它在真实基准测试中得分更高且属于更新的一代。如果只是单纯按“什么塞得下”来推荐，工具只会给你那个更大的模型。这种差距正是 whichllm 存在的意义。（注 #3：MoE 模型达到 102 t/s — 速度按激活参数计算，质量按总参数计算。）

What can I run? / 我能运行什么？

Real top picks (snapshot 2026-05 — your results track live HuggingFace data, this is not a static list): 真实的顶级推荐（快照日期 2026-05 — 你的结果将追踪 HuggingFace 的实时数据，这不是一份静态列表）：

Hardware	VRAM	Top pick	Speed
RTX 5090	32 GB	Qwen3.6-27B · Q6_K · score 94.7	~40 t/s
RTX 4090 / 3090	24 GB	Qwen3.6-27B · Q5_K_M · score 92.8	~27 t/s
RTX 4060	8 GB	Qwen3-14B · Q3_K_M · score 71.0	~22 t/s
Apple M3 Max	36 GB	Qwen3.6-27B · Q5_K_M · score 89.4	~9 t/s
CPU only	—	gpt-oss-20b (MoE) · Q4_K_M · score 45.2	~6 t/s

whichllm --gpu "<your card>" simulates any of these before you buy. 在购买前，你可以使用 whichllm --gpu "<你的显卡>" 来模拟上述任何配置。

Why whichllm? / 为什么选择 whichllm？

Fitting a model into your VRAM is the easy part. The hard part is knowing which of the models that fit is actually the best — and that is what whichllm is built to get right. 将模型塞进显存（VRAM）很简单，难的是判断在这些能塞进去的模型中，哪一个才是真正最好的——而这正是 whichllm 旨在解决的问题。

Evidence-based ranking, not a size heuristic: The top pick is chosen from merged real benchmarks (LiveBench, Artificial Analysis, Aider, multimodal/vision, Chatbot Arena ELO, Open LLM Leaderboard) — never “the biggest model that happens to fit.” 基于证据的排名，而非单纯的尺寸启发式算法：顶级推荐是从汇总的真实基准测试（LiveBench、Artificial Analysis、Aider、多模态/视觉、Chatbot Arena ELO、Open LLM Leaderboard）中选出的，绝非“刚好塞得下的最大模型”。
Recency-aware: Stale leaderboards are demoted along each model’s lineage, so a 2024 model can’t outrank a current-generation one on an outdated score. The benchmark snapshot date is printed under every ranking, so a stale recommendation is self-evident instead of silently trusted. 具备时效性感知：过时的排行榜会根据模型的代际进行降权，因此 2024 年的模型无法凭借过时的分数超越当前一代模型。每个排名下方都会打印基准测试快照日期，让过时的推荐一目了然，而不是盲目信任。
Evidence-graded and guarded: Every score is tagged direct / variant / base / interpolated / self-reported and discounted by confidence. Fabricated uploader claims and cross-family inheritance (a small fork borrowing its much larger base’s score) are actively rejected. 证据分级与防护：每个分数都标记了来源（直接/变体/基础/插值/自报），并根据置信度进行折算。虚假的上传者声明和跨家族继承（小分支借用其庞大基础模型的分数）会被主动剔除。
Architecture-aware estimates: VRAM = weights + GQA KV cache + activation + overhead; speed is bandwidth-bound with per-quant efficiency, per-backend factors, MoE active-vs-total split, and unified-memory vs discrete-PCIe partial-offload modeling. 架构感知估算：VRAM = 权重 + GQA KV 缓存 + 激活值 + 开销；速度受带宽限制，并考虑了量化效率、后端因素、MoE 激活与总参数拆分，以及统一内存与独立 PCIe 部分卸载模型。
One command, scriptable: whichllm prints the answer; add --json | jq for pipelines. No TUI, no keybindings to memorize. 单命令，可脚本化：whichllm 直接输出结果；添加 --json | jq 即可用于管道处理。无需 TUI，无需记忆快捷键。
Live data: Models fetched directly from the HuggingFace API, with curated frozen fallbacks for offline or rate-limited use. 实时数据：模型直接从 HuggingFace API 获取，并提供精选的离线备份，以应对断网或限流情况。

Features / 功能特性

Auto-detect hardware: NVIDIA, AMD, Apple Silicon, CPU-only 自动检测硬件：支持 NVIDIA、AMD、Apple Silicon 和纯 CPU 环境。
Smart ranking: Scores models by VRAM fit, speed, and benchmark quality 智能排名：根据 VRAM 适配度、速度和基准测试质量对模型进行评分。
One-command chat: whichllm run downloads and starts a chat session instantly 一键对话：whichllm run 可立即下载并启动对话会话。
Code snippets: whichllm snippet prints ready-to-run Python for any model 代码片段：whichllm snippet 可为任何模型打印可直接运行的 Python 代码。
Live data: Fetches models directly from HuggingFace (cached for performance) 实时数据：直接从 HuggingFace 获取模型（已缓存以提升性能）。
Benchmark-aware: Integrates real eval scores with confidence-based dampening 基准测试感知：整合真实评估分数，并进行基于置信度的加权处理。
Task profiles: Filter by general, coding, vision, or math use cases 任务配置：按通用、编程、视觉或数学使用场景进行筛选。
GPU simulation: Test with any GPU: whichllm --gpu "RTX 4090" GPU 模拟：使用任意 GPU 进行测试：whichllm --gpu "RTX 4090"。
Hardware planning: Reverse lookup: whichllm plan "llama 3 70b" 硬件规划：反向查询：whichllm plan "llama 3 70b"。
Upgrade planning: Compare your current machine with candidate GPUs 升级规划：比较你当前的机器与候选 GPU 的性能。
JSON output: Pipe-friendly: whichllm --json JSON 输出：管道友好：whichllm --json。

Run & Snippet / 运行与代码片段

Try any model with a single command. No manual installs needed — whichllm creates an isolated environment via uv, installs dependencies, downloads the model, and starts an interactive chat. 用一条命令尝试任何模型。无需手动安装——whichllm 会通过 uv 创建隔离环境，安装依赖，下载模型，并启动交互式对话。

# Chat with a model (auto-picks the best GGUF variant)
# 与模型对话（自动选择最佳 GGUF 变体）
whichllm run "qwen 2.5 1.5b gguf"

# Auto-pick the best model for your hardware and chat
# 自动为你的硬件选择最佳模型并对话
whichllm run

# CPU-only mode
# 纯 CPU 模式
whichllm run "phi 3 mini gguf" --cpu-only

Works with all model formats: 支持所有模型格式：

GGUF — via llama-cpp-python (lightweight, fast)
AWQ / GPTQ — via transformers + autoawq / auto-gptq
FP16 / BF16 — via transformers

Get a copy-paste Python snippet instead: 获取可直接复制的 Python 代码片段： whichllm snippet "qwen 7b"

from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2.5-7B-Instruct-GGUF",
    filename="qwen2.5-7b-instruct-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    verbose=False,
)
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
)
print(output["choices"][0]["message"]["content"])

Usage / 使用方法

# Auto-detect hardware and show best models
# 自动检测硬件并显示最佳模型
whichllm

# Simulate a GPU (e.g. planning a purchase)
# 模拟 GPU（例如计划购买时）
whichllm --gpu "RTX 4090"
whichllm --gpu "RTX 5090"

# Specify variant
# 指定变体
whichllm --gpu "RTX 5060 16"

# CPU-only mode
# 纯 CPU 模式
whichllm --cpu-only

# More results / filters
# 更多结果 / 过滤器
whichllm --top 20
whichllm --quant Q4_K_M
whichllm --min-speed 30
whichllm --evidence base # allow id/base-model matches
whichllm --evidence strict # id-exact only (same as --direct)
whichllm --direct

# JSON output
# JSON 输出
whichllm --json

# Force refresh (ignore cache)
# 强制刷新（忽略缓存）
whichllm --refresh

# Show hardware info only
# 仅显示硬件信息
whichllm hardware

# Plan: what GPU do I need for a specific model?
# 规划：运行特定模型需要什么 GPU？
whichllm plan "llama 3 70b"
whichllm plan "Qwen2.5-72B" --quant Q