LLM Wikis Are Over-Engineered — I Replaced Mine With a Pure Python Compiler
LLM Wikis Are Over-Engineered — I Replaced Mine With a Pure Python Compiler
LLM 维基系统过度工程化了——我用纯 Python 编译器替换了它
Structuring local markdown doesn’t need agents. It needs a compiler. 构建本地 Markdown 知识库不需要智能体(Agent),它需要的是一个编译器。
TL;DR: I built a pure Python pipeline that compiles a folder of raw, messy text notes into a linked, linted markdown wiki. No LLM calls, no embeddings, no external APIs, standard library only. 简而言之:我构建了一个纯 Python 流水线,可以将一堆杂乱的原始文本笔记编译成带有链接和语法检查的 Markdown 维基。无需调用 LLM,无需嵌入(embeddings),无需外部 API,仅使用标准库。
The pipeline has four stages: a regex extractor, a graph builder that detects cross-references, a section-aware rewriter that preserves anything you write by hand, and a linter that checks its own output. 该流水线包含四个阶段:正则表达式提取器、用于检测交叉引用的图构建器、能够保留手动编写内容的区域感知重写器,以及用于检查自身输出的语法检查器(Linter)。
I hit two real bugs while building this: a graph builder that scaled badly, and a linter that silently undercounted orphan pages. Both are in this article as they actually happened, along with the fixes. 在构建过程中,我遇到了两个真实的 Bug:一个是扩展性很差的图构建器,另一个是会静默漏计孤立页面的语法检查器。这两个问题及其修复方案都如实记录在本文中。
I benchmarked the full pipeline at three corpus sizes on two different machines (Linux and Windows) and checked whether the deterministic outputs actually matched across both. They did, exactly. Full code, all 17 tests, and unrounded terminal output are included below so you can rerun everything yourself. 我在两台不同的机器(Linux 和 Windows)上针对三种语料库规模对整个流水线进行了基准测试,并验证了确定性输出在两台机器上是否完全一致。结果完全吻合。完整的代码、全部 17 个测试用例以及未经舍入的终端输出均附在文末,以便你可以亲自运行验证。
Why I wrote this
我为何写这篇文章
I tried building a Karpathy-style LLM wiki. Agent loops. Recursive LLM calls. Embeddings for everything. The input was a folder of local markdown files I already had, sitting on my own disk. And partway through, it hit me: I was paying tokens to reorganize text I already owned. 我曾尝试构建一个 Karpathy 风格的 LLM 维基系统。智能体循环、递归 LLM 调用、为一切内容生成嵌入。输入源是我磁盘上现有的本地 Markdown 文件文件夹。做到一半时,我突然意识到:我正在花费 Token 去重组我早已拥有的文本。
So I replaced the entire pipeline with a pure Python compiler. This article walks through that system in full: turn a folder of raw, inconsistently formatted text notes into a linked, linted markdown wiki, with zero LLM calls, zero external APIs, and zero third-party dependencies. 于是,我用一个纯 Python 编译器替换了整个流水线。本文将完整介绍该系统:如何将一个包含原始、格式不统一的文本笔记文件夹,转化为一个带有链接、经过语法检查的 Markdown 维基,且无需任何 LLM 调用、外部 API 或第三方依赖。
The compiler mindset
编译器思维
Here’s the reframe that the rest of this article is built on: An agent decides what your wiki might look like. A compiler guarantees what it must look like. 这是本文后续内容所基于的核心理念:智能体决定了你的维基“可能”长什么样,而编译器保证了它“必须”长什么样。
I wanted this wiki to be predictable. Unlike an LLM which varies its output, a compiler gives you the same result every single time you run it. That consistency is essential for my personal reference notes. 我希望这个维基系统是可预测的。与输出不稳定的 LLM 不同,编译器每次运行都会给出相同的结果。这种一致性对于我的个人参考笔记至关重要。
Why zero dependencies matters here, specifically
为什么“零依赖”在这里尤为重要
Everything in this codebase runs on the Python standard library alone. No sentence-transformers, no vector database, no HTTP client for an embedding API. That’s not a purity test for its own sake. It’s a direct consequence of the problem this pipeline solves. 该代码库中的一切都仅依赖 Python 标准库运行。没有 sentence-transformers,没有向量数据库,也没有用于嵌入 API 的 HTTP 客户端。这并非为了追求纯粹,而是该流水线所解决的问题的直接结果。
Once you strip away the LLM calls, what’s actually left to do is text parsing, string manipulation, and graph traversal over an in-memory dictionary. Those are exactly the kinds of problems re, os, and plain Python data structures were built for.
一旦去掉了 LLM 调用,剩下的工作其实就是文本解析、字符串操作以及内存字典的图遍历。这些正是 re、os 和原生 Python 数据结构所擅长解决的问题。
The problem with agent-driven wikis
智能体驱动维基的问题所在
The idea of using an LLM to build and maintain a personal wiki isn’t new, and it isn’t mine. It gained serious traction after Andrej Karpathy described the pattern in a widely shared post… I think that compilation framing is exactly right. I just don’t think an LLM needs to be the compiler. 使用 LLM 构建和维护个人维基的想法并不新鲜,也不是我的原创。在 Andrej Karpathy 在一篇广为流传的文章中描述了这种模式后,它受到了极大的关注……我认为“编译”这个框架是非常正确的,只是我不认为 LLM 必须充当编译器。
Here’s the practical problem. If your raw source is already local, already text, and already deterministic, routing it through a probabilistic system to organize it introduces three costs that a parser or a compiler simply doesn’t have: 实际问题在于:如果你的原始数据已经是本地的、文本格式的且具有确定性的,那么通过一个概率系统来组织它,会引入解析器或编译器所没有的三种成本:
-
Cost: Every time you add a new document, an agent-driven wiki re-reads content, decides what changed, and rewrites pages. That’s token spend on organizational work, not synthesis.
-
成本:每当你添加新文档时,智能体驱动的维基都会重新读取内容、判断变化并重写页面。这是在组织工作上消耗 Token,而不是在知识合成上。
-
Latency: Every read-decide-write cycle is a network round trip if you’re using a hosted model, and a real compute cost even if you’re running something local.
-
延迟:如果你使用托管模型,每个“读取-决策-写入”循环都需要网络往返;即使是在本地运行,也会产生实际的计算成本。
-
Non-determinism: This is the one that actually bit me. I ran the same folder through…
-
非确定性:这正是真正困扰我的地方。我曾多次运行同一个文件夹……