Andyyyy64 / whichllm

Andyyyy64 / whichllm

whichllm helps you find the best local LLM that actually runs on your hardware. It auto-detects your GPU/CPU/RAM and ranks the top models from HuggingFace that fit your system. whichllm 是一款能帮你找到真正能在你的硬件上运行的最佳本地大模型(LLM)的工具。它会自动检测你的 GPU、CPU 和内存(RAM),并从 HuggingFace 上筛选出最适合你系统的顶级模型。

Quick start / 快速开始

Run the recommendation command once, with no project setup. 无需配置项目,直接运行推荐命令即可: uvx whichllm@latest

Simulate a GPU before you buy hardware. 在购买硬件前模拟 GPU 性能: uvx whichllm@latest --gpu "RTX 4090"

Install it when you use it often. 如果你经常使用,可以安装它:

uv tool install whichllm
uv tool upgrade whichllm # update an existing install

Other install paths: 其他安装方式:

brew install andyyyy64/whichllm/whichllm
pip install whichllm

Common workflows / 常用工作流

After install, run whichllm directly. For one-off runs, replace whichllm with uvx whichllm@latest. 安装后,可以直接运行 whichllm。如果是单次运行,请将 whichllm 替换为 uvx whichllm@latest

# Best models for this machine / 为当前机器推荐最佳模型
whichllm

# Pretend you have a specific GPU / 模拟特定 GPU
whichllm --gpu "RTX 4090"

# Compare upgrade candidates / 比较升级方案
whichllm upgrade "RTX 4090" "RTX 5090" "H100"

# Find the GPU needed for a model / 查询运行特定模型所需的 GPU
whichllm plan "llama 3 70b"

# Start a chat with a model / 与模型开启对话
whichllm run "qwen 2.5 1.5b gguf"

# Print copy-paste Python snippet / 打印可直接复制的 Python 代码片段
whichllm snippet "qwen 7b"

# Return JSON for scripts / 返回 JSON 格式供脚本调用
whichllm --top 1 --json

See it / 效果预览

$ whichllm --gpu "RTX 4090"
#1 Qwen/Qwen3.6-27B 27.8B Q5_K_M score 92.8 27 t/s
#2 Qwen/Qwen3-32B 32.0B Q4_K_M score 83.0 31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B Q5_K_M score 82.7 102 t/s

The 32B model fits your card fine — whichllm still ranks the 27B #1, because it scores higher on real benchmarks and is a newer generation. A size-only “what fits?” tool would hand you the bigger one. That gap is the whole point of whichllm. (Note #3: a MoE model at 102 t/s — speed is ranked on active params, quality on total.) 虽然 32B 模型完全塞得进你的显卡,但 whichllm 依然将 27B 模型排在第一位,因为它在真实基准测试中得分更高且属于更新的一代。如果只是单纯按“什么塞得下”来推荐,工具只会给你那个更大的模型。这种差距正是 whichllm 存在的意义。(注 #3:MoE 模型达到 102 t/s — 速度按激活参数计算,质量按总参数计算。)

What can I run? / 我能运行什么?

Real top picks (snapshot 2026-05 — your results track live HuggingFace data, this is not a static list): 真实的顶级推荐(快照日期 2026-05 — 你的结果将追踪 HuggingFace 的实时数据,这不是一份静态列表):

HardwareVRAMTop pickSpeed
RTX 509032 GBQwen3.6-27B · Q6_K · score 94.7~40 t/s
RTX 4090 / 309024 GBQwen3.6-27B · Q5_K_M · score 92.8~27 t/s
RTX 40608 GBQwen3-14B · Q3_K_M · score 71.0~22 t/s
Apple M3 Max36 GBQwen3.6-27B · Q5_K_M · score 89.4~9 t/s
CPU onlygpt-oss-20b (MoE) · Q4_K_M · score 45.2~6 t/s

whichllm --gpu "<your card>" simulates any of these before you buy. 在购买前,你可以使用 whichllm --gpu "<你的显卡>" 来模拟上述任何配置。

Why whichllm? / 为什么选择 whichllm?

Fitting a model into your VRAM is the easy part. The hard part is knowing which of the models that fit is actually the best — and that is what whichllm is built to get right. 将模型塞进显存(VRAM)很简单,难的是判断在这些能塞进去的模型中,哪一个才是真正最好的——而这正是 whichllm 旨在解决的问题。

  • Evidence-based ranking, not a size heuristic: The top pick is chosen from merged real benchmarks (LiveBench, Artificial Analysis, Aider, multimodal/vision, Chatbot Arena ELO, Open LLM Leaderboard) — never “the biggest model that happens to fit.” 基于证据的排名,而非单纯的尺寸启发式算法:顶级推荐是从汇总的真实基准测试(LiveBench、Artificial Analysis、Aider、多模态/视觉、Chatbot Arena ELO、Open LLM Leaderboard)中选出的,绝非“刚好塞得下的最大模型”。
  • Recency-aware: Stale leaderboards are demoted along each model’s lineage, so a 2024 model can’t outrank a current-generation one on an outdated score. The benchmark snapshot date is printed under every ranking, so a stale recommendation is self-evident instead of silently trusted. 具备时效性感知:过时的排行榜会根据模型的代际进行降权,因此 2024 年的模型无法凭借过时的分数超越当前一代模型。每个排名下方都会打印基准测试快照日期,让过时的推荐一目了然,而不是盲目信任。
  • Evidence-graded and guarded: Every score is tagged direct / variant / base / interpolated / self-reported and discounted by confidence. Fabricated uploader claims and cross-family inheritance (a small fork borrowing its much larger base’s score) are actively rejected. 证据分级与防护:每个分数都标记了来源(直接/变体/基础/插值/自报),并根据置信度进行折算。虚假的上传者声明和跨家族继承(小分支借用其庞大基础模型的分数)会被主动剔除。
  • Architecture-aware estimates: VRAM = weights + GQA KV cache + activation + overhead; speed is bandwidth-bound with per-quant efficiency, per-backend factors, MoE active-vs-total split, and unified-memory vs discrete-PCIe partial-offload modeling. 架构感知估算:VRAM = 权重 + GQA KV 缓存 + 激活值 + 开销;速度受带宽限制,并考虑了量化效率、后端因素、MoE 激活与总参数拆分,以及统一内存与独立 PCIe 部分卸载模型。
  • One command, scriptable: whichllm prints the answer; add --json | jq for pipelines. No TUI, no keybindings to memorize. 单命令,可脚本化:whichllm 直接输出结果;添加 --json | jq 即可用于管道处理。无需 TUI,无需记忆快捷键。
  • Live data: Models fetched directly from the HuggingFace API, with curated frozen fallbacks for offline or rate-limited use. 实时数据:模型直接从 HuggingFace API 获取,并提供精选的离线备份,以应对断网或限流情况。

Features / 功能特性

  • Auto-detect hardware: NVIDIA, AMD, Apple Silicon, CPU-only 自动检测硬件:支持 NVIDIA、AMD、Apple Silicon 和纯 CPU 环境。
  • Smart ranking: Scores models by VRAM fit, speed, and benchmark quality 智能排名:根据 VRAM 适配度、速度和基准测试质量对模型进行评分。
  • One-command chat: whichllm run downloads and starts a chat session instantly 一键对话whichllm run 可立即下载并启动对话会话。
  • Code snippets: whichllm snippet prints ready-to-run Python for any model 代码片段whichllm snippet 可为任何模型打印可直接运行的 Python 代码。
  • Live data: Fetches models directly from HuggingFace (cached for performance) 实时数据:直接从 HuggingFace 获取模型(已缓存以提升性能)。
  • Benchmark-aware: Integrates real eval scores with confidence-based dampening 基准测试感知:整合真实评估分数,并进行基于置信度的加权处理。
  • Task profiles: Filter by general, coding, vision, or math use cases 任务配置:按通用、编程、视觉或数学使用场景进行筛选。
  • GPU simulation: Test with any GPU: whichllm --gpu "RTX 4090" GPU 模拟:使用任意 GPU 进行测试:whichllm --gpu "RTX 4090"
  • Hardware planning: Reverse lookup: whichllm plan "llama 3 70b" 硬件规划:反向查询:whichllm plan "llama 3 70b"
  • Upgrade planning: Compare your current machine with candidate GPUs 升级规划:比较你当前的机器与候选 GPU 的性能。
  • JSON output: Pipe-friendly: whichllm --json JSON 输出:管道友好:whichllm --json

Run & Snippet / 运行与代码片段

Try any model with a single command. No manual installs needed — whichllm creates an isolated environment via uv, installs dependencies, downloads the model, and starts an interactive chat. 用一条命令尝试任何模型。无需手动安装——whichllm 会通过 uv 创建隔离环境,安装依赖,下载模型,并启动交互式对话。

# Chat with a model (auto-picks the best GGUF variant)
# 与模型对话(自动选择最佳 GGUF 变体)
whichllm run "qwen 2.5 1.5b gguf"

# Auto-pick the best model for your hardware and chat
# 自动为你的硬件选择最佳模型并对话
whichllm run

# CPU-only mode
# 纯 CPU 模式
whichllm run "phi 3 mini gguf" --cpu-only

Works with all model formats: 支持所有模型格式:

  • GGUF — via llama-cpp-python (lightweight, fast)
  • AWQ / GPTQ — via transformers + autoawq / auto-gptq
  • FP16 / BF16 — via transformers

Get a copy-paste Python snippet instead: 获取可直接复制的 Python 代码片段: whichllm snippet "qwen 7b"

from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2.5-7B-Instruct-GGUF",
    filename="qwen2.5-7b-instruct-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    verbose=False,
)
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
)
print(output["choices"][0]["message"]["content"])

Usage / 使用方法

# Auto-detect hardware and show best models
# 自动检测硬件并显示最佳模型
whichllm

# Simulate a GPU (e.g. planning a purchase)
# 模拟 GPU(例如计划购买时)
whichllm --gpu "RTX 4090"
whichllm --gpu "RTX 5090"

# Specify variant
# 指定变体
whichllm --gpu "RTX 5060 16"

# CPU-only mode
# 纯 CPU 模式
whichllm --cpu-only

# More results / filters
# 更多结果 / 过滤器
whichllm --top 20
whichllm --quant Q4_K_M
whichllm --min-speed 30
whichllm --evidence base # allow id/base-model matches
whichllm --evidence strict # id-exact only (same as --direct)
whichllm --direct

# JSON output
# JSON 输出
whichllm --json

# Force refresh (ignore cache)
# 强制刷新(忽略缓存)
whichllm --refresh

# Show hardware info only
# 仅显示硬件信息
whichllm hardware

# Plan: what GPU do I need for a specific model?
# 规划:运行特定模型需要什么 GPU?
whichllm plan "llama 3 70b"
whichllm plan "Qwen2.5-72B" --quant Q