Andyyyy64 / whichllm
Andyyyy64 / whichllm
whichllm helps you find the best local LLM that actually runs on your hardware. It auto-detects your GPU/CPU/RAM and ranks the top models from HuggingFace that fit your system. whichllm 是一款能帮你找到真正能在你的硬件上运行的最佳本地大模型(LLM)的工具。它会自动检测你的 GPU、CPU 和内存(RAM),并从 HuggingFace 上筛选出最适合你系统的顶级模型。
Quick start / 快速开始
Run the recommendation command once, with no project setup.
无需配置项目,直接运行推荐命令即可:
uvx whichllm@latest
Simulate a GPU before you buy hardware.
在购买硬件前模拟 GPU 性能:
uvx whichllm@latest --gpu "RTX 4090"
Install it when you use it often. 如果你经常使用,可以安装它:
uv tool install whichllm
uv tool upgrade whichllm # update an existing install
Other install paths: 其他安装方式:
brew install andyyyy64/whichllm/whichllm
pip install whichllm
Common workflows / 常用工作流
After install, run whichllm directly. For one-off runs, replace whichllm with uvx whichllm@latest.
安装后,可以直接运行 whichllm。如果是单次运行,请将 whichllm 替换为 uvx whichllm@latest。
# Best models for this machine / 为当前机器推荐最佳模型
whichllm
# Pretend you have a specific GPU / 模拟特定 GPU
whichllm --gpu "RTX 4090"
# Compare upgrade candidates / 比较升级方案
whichllm upgrade "RTX 4090" "RTX 5090" "H100"
# Find the GPU needed for a model / 查询运行特定模型所需的 GPU
whichllm plan "llama 3 70b"
# Start a chat with a model / 与模型开启对话
whichllm run "qwen 2.5 1.5b gguf"
# Print copy-paste Python snippet / 打印可直接复制的 Python 代码片段
whichllm snippet "qwen 7b"
# Return JSON for scripts / 返回 JSON 格式供脚本调用
whichllm --top 1 --json
See it / 效果预览
$ whichllm --gpu "RTX 4090"
#1 Qwen/Qwen3.6-27B 27.8B Q5_K_M score 92.8 27 t/s
#2 Qwen/Qwen3-32B 32.0B Q4_K_M score 83.0 31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B Q5_K_M score 82.7 102 t/s
The 32B model fits your card fine — whichllm still ranks the 27B #1, because it scores higher on real benchmarks and is a newer generation. A size-only “what fits?” tool would hand you the bigger one. That gap is the whole point of whichllm. (Note #3: a MoE model at 102 t/s — speed is ranked on active params, quality on total.) 虽然 32B 模型完全塞得进你的显卡,但 whichllm 依然将 27B 模型排在第一位,因为它在真实基准测试中得分更高且属于更新的一代。如果只是单纯按“什么塞得下”来推荐,工具只会给你那个更大的模型。这种差距正是 whichllm 存在的意义。(注 #3:MoE 模型达到 102 t/s — 速度按激活参数计算,质量按总参数计算。)
What can I run? / 我能运行什么?
Real top picks (snapshot 2026-05 — your results track live HuggingFace data, this is not a static list): 真实的顶级推荐(快照日期 2026-05 — 你的结果将追踪 HuggingFace 的实时数据,这不是一份静态列表):
| Hardware | VRAM | Top pick | Speed |
|---|---|---|---|
| RTX 5090 | 32 GB | Qwen3.6-27B · Q6_K · score 94.7 | ~40 t/s |
| RTX 4090 / 3090 | 24 GB | Qwen3.6-27B · Q5_K_M · score 92.8 | ~27 t/s |
| RTX 4060 | 8 GB | Qwen3-14B · Q3_K_M · score 71.0 | ~22 t/s |
| Apple M3 Max | 36 GB | Qwen3.6-27B · Q5_K_M · score 89.4 | ~9 t/s |
| CPU only | — | gpt-oss-20b (MoE) · Q4_K_M · score 45.2 | ~6 t/s |
whichllm --gpu "<your card>" simulates any of these before you buy.
在购买前,你可以使用 whichllm --gpu "<你的显卡>" 来模拟上述任何配置。
Why whichllm? / 为什么选择 whichllm?
Fitting a model into your VRAM is the easy part. The hard part is knowing which of the models that fit is actually the best — and that is what whichllm is built to get right. 将模型塞进显存(VRAM)很简单,难的是判断在这些能塞进去的模型中,哪一个才是真正最好的——而这正是 whichllm 旨在解决的问题。
- Evidence-based ranking, not a size heuristic: The top pick is chosen from merged real benchmarks (LiveBench, Artificial Analysis, Aider, multimodal/vision, Chatbot Arena ELO, Open LLM Leaderboard) — never “the biggest model that happens to fit.” 基于证据的排名,而非单纯的尺寸启发式算法:顶级推荐是从汇总的真实基准测试(LiveBench、Artificial Analysis、Aider、多模态/视觉、Chatbot Arena ELO、Open LLM Leaderboard)中选出的,绝非“刚好塞得下的最大模型”。
- Recency-aware: Stale leaderboards are demoted along each model’s lineage, so a 2024 model can’t outrank a current-generation one on an outdated score. The benchmark snapshot date is printed under every ranking, so a stale recommendation is self-evident instead of silently trusted. 具备时效性感知:过时的排行榜会根据模型的代际进行降权,因此 2024 年的模型无法凭借过时的分数超越当前一代模型。每个排名下方都会打印基准测试快照日期,让过时的推荐一目了然,而不是盲目信任。
- Evidence-graded and guarded: Every score is tagged direct / variant / base / interpolated / self-reported and discounted by confidence. Fabricated uploader claims and cross-family inheritance (a small fork borrowing its much larger base’s score) are actively rejected. 证据分级与防护:每个分数都标记了来源(直接/变体/基础/插值/自报),并根据置信度进行折算。虚假的上传者声明和跨家族继承(小分支借用其庞大基础模型的分数)会被主动剔除。
- Architecture-aware estimates: VRAM = weights + GQA KV cache + activation + overhead; speed is bandwidth-bound with per-quant efficiency, per-backend factors, MoE active-vs-total split, and unified-memory vs discrete-PCIe partial-offload modeling. 架构感知估算:VRAM = 权重 + GQA KV 缓存 + 激活值 + 开销;速度受带宽限制,并考虑了量化效率、后端因素、MoE 激活与总参数拆分,以及统一内存与独立 PCIe 部分卸载模型。
- One command, scriptable: whichllm prints the answer; add
--json | jqfor pipelines. No TUI, no keybindings to memorize. 单命令,可脚本化:whichllm 直接输出结果;添加--json | jq即可用于管道处理。无需 TUI,无需记忆快捷键。 - Live data: Models fetched directly from the HuggingFace API, with curated frozen fallbacks for offline or rate-limited use. 实时数据:模型直接从 HuggingFace API 获取,并提供精选的离线备份,以应对断网或限流情况。
Features / 功能特性
- Auto-detect hardware: NVIDIA, AMD, Apple Silicon, CPU-only 自动检测硬件:支持 NVIDIA、AMD、Apple Silicon 和纯 CPU 环境。
- Smart ranking: Scores models by VRAM fit, speed, and benchmark quality 智能排名:根据 VRAM 适配度、速度和基准测试质量对模型进行评分。
- One-command chat:
whichllm rundownloads and starts a chat session instantly 一键对话:whichllm run可立即下载并启动对话会话。 - Code snippets:
whichllm snippetprints ready-to-run Python for any model 代码片段:whichllm snippet可为任何模型打印可直接运行的 Python 代码。 - Live data: Fetches models directly from HuggingFace (cached for performance) 实时数据:直接从 HuggingFace 获取模型(已缓存以提升性能)。
- Benchmark-aware: Integrates real eval scores with confidence-based dampening 基准测试感知:整合真实评估分数,并进行基于置信度的加权处理。
- Task profiles: Filter by general, coding, vision, or math use cases 任务配置:按通用、编程、视觉或数学使用场景进行筛选。
- GPU simulation: Test with any GPU:
whichllm --gpu "RTX 4090"GPU 模拟:使用任意 GPU 进行测试:whichllm --gpu "RTX 4090"。 - Hardware planning: Reverse lookup:
whichllm plan "llama 3 70b"硬件规划:反向查询:whichllm plan "llama 3 70b"。 - Upgrade planning: Compare your current machine with candidate GPUs 升级规划:比较你当前的机器与候选 GPU 的性能。
- JSON output: Pipe-friendly:
whichllm --jsonJSON 输出:管道友好:whichllm --json。
Run & Snippet / 运行与代码片段
Try any model with a single command. No manual installs needed — whichllm creates an isolated environment via uv, installs dependencies, downloads the model, and starts an interactive chat. 用一条命令尝试任何模型。无需手动安装——whichllm 会通过 uv 创建隔离环境,安装依赖,下载模型,并启动交互式对话。
# Chat with a model (auto-picks the best GGUF variant)
# 与模型对话(自动选择最佳 GGUF 变体)
whichllm run "qwen 2.5 1.5b gguf"
# Auto-pick the best model for your hardware and chat
# 自动为你的硬件选择最佳模型并对话
whichllm run
# CPU-only mode
# 纯 CPU 模式
whichllm run "phi 3 mini gguf" --cpu-only
Works with all model formats: 支持所有模型格式:
- GGUF — via llama-cpp-python (lightweight, fast)
- AWQ / GPTQ — via transformers + autoawq / auto-gptq
- FP16 / BF16 — via transformers
Get a copy-paste Python snippet instead:
获取可直接复制的 Python 代码片段:
whichllm snippet "qwen 7b"
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="Qwen/Qwen2.5-7B-Instruct-GGUF",
filename="qwen2.5-7b-instruct-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=-1,
verbose=False,
)
output = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
)
print(output["choices"][0]["message"]["content"])
Usage / 使用方法
# Auto-detect hardware and show best models
# 自动检测硬件并显示最佳模型
whichllm
# Simulate a GPU (e.g. planning a purchase)
# 模拟 GPU(例如计划购买时)
whichllm --gpu "RTX 4090"
whichllm --gpu "RTX 5090"
# Specify variant
# 指定变体
whichllm --gpu "RTX 5060 16"
# CPU-only mode
# 纯 CPU 模式
whichllm --cpu-only
# More results / filters
# 更多结果 / 过滤器
whichllm --top 20
whichllm --quant Q4_K_M
whichllm --min-speed 30
whichllm --evidence base # allow id/base-model matches
whichllm --evidence strict # id-exact only (same as --direct)
whichllm --direct
# JSON output
# JSON 输出
whichllm --json
# Force refresh (ignore cache)
# 强制刷新(忽略缓存)
whichllm --refresh
# Show hardware info only
# 仅显示硬件信息
whichllm hardware
# Plan: what GPU do I need for a specific model?
# 规划:运行特定模型需要什么 GPU?
whichllm plan "llama 3 70b"
whichllm plan "Qwen2.5-72B" --quant Q