I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't.
I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don’t.
我在 Hono 上测试了 CodeGraph:工具调用次数的节省得以复现,但成本节省并未实现。
Two weeks ago CodeGraph hit GitHub trending — tree-sitter + SQLite/FTS5 + MCP for Claude Code, 19k+ stars in a week. The team published a benchmark on 7 repos showing 35% cheaper, 57% fewer tokens, 46% faster, 71% fewer tool calls vs. baseline. Those are big numbers. They’re also numbers from a benchmark designed by the team that built the tool, on repos they chose. Designer bias is the #1 risk in any retrieval benchmark — when you pick the test repos and write the ground truth, you’ll consciously or unconsciously favor your own tool’s strengths. 两周前,CodeGraph 登上了 GitHub 趋势榜——它结合了 tree-sitter、SQLite/FTS5 和用于 Claude Code 的 MCP,在一周内获得了超过 1.9 万颗星。该团队发布了一项针对 7 个代码库的基准测试,显示与基准方案相比,成本降低了 35%,Token 消耗减少了 57%,速度提升了 46%,工具调用次数减少了 71%。这些数字非常亮眼。但它们来自一个由工具开发团队自行设计、并由他们挑选代码库的基准测试。在任何检索基准测试中,“设计者偏见”都是首要风险——当你选择测试库并编写标准答案时,会有意或无意地偏向自己工具的优势。
So I ran an independent test on an 8th repo — Hono (TypeScript, ~280 source files, in neither CodeGraph’s published 7-repo suite nor any other published benchmark I could find). 5 architectural questions covering different retrieval shapes, with a deliberate control case (Q5) where the tool should not win. Two conditions (baseline grep+Read+Glob+Explore vs. CodeGraph active), 4 repeats per question per condition. 40 runs on Claude Opus 4.8 — and, critically, every CodeGraph run was verified to have connected, and actual codegraph_* tool usage was recorded per run (more on why that sentence exists below).
因此,我在第 8 个代码库 Hono(TypeScript,约 280 个源文件,既不在 CodeGraph 已发布的 7 个库中,也不在其他任何我能找到的基准测试中)上进行了独立测试。我设计了 5 个涵盖不同检索形态的架构问题,并设置了一个刻意的对照组(Q5),在该场景下该工具本不应占优。测试分为两种条件(基准组:grep+Read+Glob+Explore vs. 实验组:启用 CodeGraph),每个问题在每种条件下重复 4 次。在 Claude Opus 4.8 上进行了 40 次运行——至关重要的是,我验证了每次 CodeGraph 的运行都已成功连接,并记录了每次运行中实际的 codegraph_* 工具使用情况(下文会解释为什么会有这句话)。
The result splits in a way the single published headline number hides — and the split is the useful part. tl;dr — On Hono, CodeGraph delivers a large, consistent reduction in tool calls (-55%, 14.0 → 6.3 avg) and a smaller latency win (-20%) — the published 7-repo direction reproduces here. But cost is a wash: +6.8%, not the published −35%. On narrow-scope questions (route lookup, middleware trace) CodeGraph is actually 20-43% more expensive, because each structural lookup loads a big chunk of graph context that costs more in cached tokens than the grep round-trips it replaces. 测试结果呈现出的分歧,被单一的发布标题所掩盖,而这种分歧恰恰是该测试最有价值的部分。简而言之:在 Hono 上,CodeGraph 确实带来了工具调用次数的大幅且持续的减少(-55%,从平均 14.0 次降至 6.3 次),以及较小的延迟优化(-20%)——这印证了此前 7 个代码库基准测试的趋势。但成本方面却持平:增加了 6.8%,而非官方宣称的降低 35%。在范围较窄的问题(如路由查找、中间件追踪)上,CodeGraph 的成本实际上高出了 20-43%,因为每次结构化查找都会加载一大块图上下文,其缓存 Token 的成本比它所替代的 grep 往返调用更高。
The cost win only appears on broad multi-file navigation (Q3 multi-runtime adapters: −29% cost, −80% tool calls, −53% latency). A second finding: baseline grep+Read has high variance — the agent occasionally spiraled to 47-52 tool calls on the broad questions, while CodeGraph never exceeded 16. Net at Hono’s size: CodeGraph makes the agent take fewer steps and finish faster, but not for fewer dollars. Total cost of the 40 valid runs: ~$14 of Opus 4.8 calls. Raw per-run CSV and the 5 verbatim prompts are below. 成本优势仅出现在广泛的多文件导航场景中(Q3 多运行时适配器:成本降低 29%,工具调用减少 80%,延迟降低 53%)。第二个发现是:基准组 grep+Read 的方差很大——在处理广泛问题时,Agent 有时会陷入循环,调用次数高达 47-52 次,而 CodeGraph 从未超过 16 次。总结 Hono 这种规模的项目:CodeGraph 让 Agent 执行步骤更少、完成速度更快,但并没有节省费用。40 次有效运行的总成本约为 14 美元的 Opus 4.8 调用费用。原始运行的 CSV 数据和 5 个逐字提示词附在文末。
What “tool calls down, cost flat” actually means: CodeGraph’s published 7-repo suite (VS Code, Excalidraw, Django, Tokio, OkHttp, Gin, Alamofire) skews larger and more architecturally complex than Hono. Hono is ~280 TypeScript source files (362 files indexed by CodeGraph, including tests and configs), 16MB on disk — small enough that a thoughtful agent with grep + Read can finish most architectural questions in a handful of tool calls. The interesting result is that the axes come apart. CodeGraph replaces several grep+Read round-trips with one or two structural lookups — so step count drops hard (-55%). “工具调用减少,成本持平”的真正含义是:CodeGraph 发布的 7 个代码库(VS Code, Excalidraw, Django, Tokio, OkHttp, Gin, Alamofire)在规模和架构复杂度上都比 Hono 更大。Hono 只有约 280 个 TypeScript 源文件(CodeGraph 索引了 362 个文件,包括测试和配置),磁盘占用 16MB——对于一个聪明的 Agent 来说,仅靠 grep + Read 就能在几次工具调用内完成大多数架构问题。有趣的结果是,这两个指标(调用次数与成本)出现了背离。CodeGraph 用一两次结构化查找替代了多次 grep+Read 往返——因此步骤数大幅下降(-55%)。
But each codegraph_context / codegraph_explore call returns a sizeable chunk of graph context, which then rides along in the conversation cache and gets re-read every turn. At Hono’s size, the dollar cost of carrying that cached payload roughly equals the dollar cost of the grep round-trips it replaced — so dollars stay flat (+7%) even as steps fall by more than half. That’s not a contradiction of the cost-curve thesis from the prior post in this mini-series — it’s a sharper reading of it. Hono sits above the step-count crossover (the index already saves tool calls) but below the dollar crossover (it doesn’t yet save money). On a much bigger repo, the grep path churns through far more files and the index pays back on dollars too. Hono just happens to land in the gap between the two crossovers.
但每次 codegraph_context / codegraph_explore 调用都会返回一大块图上下文,这些内容会随对话缓存一起存在,并在每一轮中被重新读取。在 Hono 这种规模下,携带这些缓存负载的美元成本,大致等于它所替代的 grep 往返调用的成本——因此,尽管步骤数减少了一半以上,但美元成本依然持平(+7%)。这并没有否定本系列前文关于成本曲线的论点,反而对其进行了更精准的解读。Hono 处于“步骤数交叉点”之上(索引已经节省了工具调用),但处于“美元成本交叉点”之下(尚未节省金钱)。在更大的代码库中,grep 路径需要处理的文件多得多,索引带来的成本节省才会显现。Hono 恰好落在了这两个交叉点之间的空隙里。
A useful complementary benchmark answers three things the published one doesn’t:
- Cross-validation on a repo not chosen by the tool’s team — do the published advantages generalize?
- Within-repo variance across question types — does the win concentrate on certain question shapes? (It does — heavily.)
- A control case where the tool shouldn’t win — Q5 (text search) tests whether the agent correctly declines to use the structural engine when grep is the right tool. 一个有用的补充基准测试回答了官方测试未涉及的三个问题:
- 在非工具团队选择的代码库上进行交叉验证——官方宣称的优势是否具有普适性?
- 同一代码库内不同问题类型的方差——优势是否集中在某些特定问题形态上?(确实如此,非常明显。)
- 一个工具本不应获胜的对照组——Q5(文本搜索)测试了当 grep 是正确工具时,Agent 是否能正确拒绝使用结构化引擎。
Setup — install CodeGraph, ~10 minutes
install (downloads a single binary, no Node/npm required)
curl -fsSL https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.sh | sh
clone the test repo + index it
git clone https://github.com/honojs/hono.git ~/tmp/hono cd ~/tmp/hono codegraph init -i Index build time on Hono (362 files, 4,128 nodes, 8,225 edges): 1.7 seconds. On-disk index: 7.1 MB. 设置——安装 CodeGraph,约 10 分钟
安装(下载单个二进制文件,无需 Node/npm)
curl -fsSL https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.sh | sh
克隆测试库并建立索引
git clone https://github.com/honojs/hono.git ~/tmp/hono cd ~/tmp/hono codegraph init -i Hono 上的索引构建时间(362 个文件,4,128 个节点,8,225 条边):1.7 秒。磁盘索引大小:7.1 MB。
Per-condition setup for the two arms:
Baseline (control): a clean copy of Hono via rsync -a —exclude=’.codegraph/’ to a separate directory so Claude couldn’t accidentally grep into the index. No MCP servers registered. Agent uses native Glob + Grep + Read + Explore + Task.
CodeGraph active: original Hono directory with .codegraph/ present, MCP server registered: {“mcpServers”: {“codegraph”: {“command”: “codegraph”, “args”: [“serve”, “—mcp”]}}}
Both arms run claude —print —output-format stream-json —model opus so the model and the rest of the agent loop are identical; the only varying input is whether the CodeGraph MCP server is in the loop. Each run is a fresh session with no prior context.
两组测试的设置:
基准组(对照):通过 rsync -a --exclude='.codegraph/' 将 Hono 的干净副本复制到独立目录,以防 Claude 意外 grep 到索引文件。未注册任何 MCP 服务器。Agent 使用原生的 Glob + Grep + Read + Explore + Task。
CodeGraph 实验组:原始 Hono 目录,包含 .codegraph/,注册了 MCP 服务器:{"mcpServers": {"codegraph": {"command": "codegraph", "args": ["serve", "--mcp"]}}}
两组均运行 claude --print --output-format stream-json --model opus,确保模型和 Agent 的其余循环逻辑完全一致;唯一的变量是是否在循环中启用了 CodeGraph MCP 服务器。每次运行都是没有先前上下文的全新会话。
Verifying the tool actually ran (this is not optional)
A retrieval-tool benchmark is only valid if the tool is actually in the loop — and I learned that the hard way. My first pass at this benchmark silently ran with CodeGraph’s MCP server never connected: the config was missing the —mcp flag, and Claude Code proceeds without a server that fails its hand-shake in time rather than erroring out. Every “CodeGraph” run was really just grep+Read. The comparison was noise, and the numbers looked plausibly small — which is exactly how a broken benchmark slips through. So for the data here, every run is instrumented: —strict-mcp-config — only the server under test is loaded, with no contamination from other globally-registered MCP servers.
验证工具是否实际运行(这是必须的)
只有当工具确实参与了循环时,检索工具的基准测试才有效——我对此深有体会。我第一次进行此基准测试时,CodeGraph 的 MCP 服务器实际上并未连接,但测试却静默运行了:配置中缺少了 --mcp 标志,而 Claude Code 在服务器握手超时时会直接跳过,而不是报错。所有的“CodeGraph”运行实际上都只是 grep+Read。这种对比毫无意义,但数字看起来却“合理地小”——这正是错误的基准测试蒙混过关的方式。因此,对于本文的数据,每次运行都进行了严格配置:--strict-mcp-config ——仅加载被测服务器,不受到其他全局注册的 MCP 服务器的干扰。