Baseline Enterprise RAG, From PDF to Highlighted Answer

企业级 RAG 基准：从 PDF 到高亮答案

LLM Applications Baseline Enterprise RAG, From PDF to Highlighted Answer. [Enterprise Document Intelligence] [Vol. 1 #1] The smallest version of RAG that actually works, on a real PDF, with grounded answers and the source lines highlighted. LLM 应用：企业级 RAG 基准，从 PDF 到高亮答案。[企业文档智能] [第 1 卷 #1] 这是最精简且真正可用的 RAG 版本，它基于真实的 PDF 文档，能够提供有据可查的答案，并高亮显示原始引用行。

The fastest way to understand what RAG is is to build the smallest version that actually works, run it on a real document, and look closely at what just happened. That’s this article. About a hundred lines of Python (no vector database, no framework, no agents) running on the Attention Is All You Need paper, returning a sourced answer with the exact source lines highlighted on the page. 理解 RAG 最快的方法，就是构建一个最精简且真正可用的版本，在真实文档上运行它，并仔细观察其工作过程。本文正是为此而写。通过大约一百行 Python 代码（无需向量数据库、框架或智能体），在《Attention Is All You Need》论文上运行，返回带有来源的答案，并在页面上高亮显示具体的引用行。

Then we walk back through each block and ask the question it naturally raises. Each question is what a later article develops. The minimal pipeline is the smallest amount of code that respects the four bricks and produces a verifiable answer. Every later article adds capability the team needs after a specific failure on real documents, not because the architecture needed more layers. 随后，我们将回顾每一个模块，并提出其自然引出的问题。每一个问题都将成为后续文章探讨的主题。这个最小化流水线是用最少的代码实现了四个核心模块，并生成了可验证的答案。后续文章中增加的每一项功能，都是团队在处理真实文档遇到特定失败后所必需的，而非仅仅为了增加架构层级。

1. What we’re building

1. 我们正在构建什么

The pipeline has four bricks (Part II goes into each one in detail) plus a final, optional rendering step. Each brick says what it takes in and what it gives back; what we pass from one brick to the next is what we save. 该流水线包含四个模块（第二部分将详细介绍每一个模块）以及一个可选的最终渲染步骤。每个模块都明确了其输入和输出；我们从一个模块传递到下一个模块的数据，就是我们需要保存的内容。

Document parsing takes a PDF path and returns line_df (one row per text line, with page_num, line_num, text, and the bounding box) plus page_df. The minimal version holds both in memory; bigger systems persist them.
文档解析：接收 PDF 路径并返回 line_df（每行文本占一行，包含页码、行号、文本和边界框）以及 page_df。最小化版本将两者保存在内存中；更大规模的系统则会将它们持久化存储。
Question parsing turns the user’s question into a ParsedQuestion carrying the normalized question plus a short list of checked keywords. It stays narrow on purpose: no retrieval logic here, no question embedding.
问题解析：将用户的问题转换为 ParsedQuestion，其中包含规范化后的问题以及一小段经过检查的关键词列表。其目的在于保持专注：此处不涉及检索逻辑，也不进行问题向量化。
Retrieval consumes the ParsedQuestion and emits top-k page numbers (and, when needed, the matching line numbers within those pages). Keeping the handoff to page numbers only keeps it small; the next step rebuilds the filtered lines from line_df on the spot.
检索：消耗 ParsedQuestion 并输出前 K 个页码（必要时包括这些页面内的匹配行号）。仅传递页码能保持系统的轻量化；下一步将直接从 line_df 中实时重建过滤后的行。
Generation brings together the question, line_df, and the retrieved page numbers, and produces an AnswerWithEvidence: a typed JSON carrying the answer, the evidence span, a confidence, a justification, the exact quotes from the source, and any caveats.
生成：整合问题、line_df 和检索到的页码，生成 AnswerWithEvidence：这是一个结构化的 JSON，包含答案、证据范围、置信度、理由、原始引用以及任何注意事项。
PDF annotation is optional. Given the source PDF and the evidence span, it writes an annotated PDF with rectangles drawn around the cited lines.
PDF 标注：这是可选步骤。给定原始 PDF 和证据范围，它会生成一个标注后的 PDF，并在引用的行周围绘制矩形框。

The dependencies are minimal: pymupdf parses PDFs; openai is the LLM client; pandas holds the document as a DataFrame; pydantic defines the answer schema. No vector database, no orchestration framework, no specialized RAG library. 依赖项非常精简：pymupdf 用于解析 PDF；openai 作为 LLM 客户端；pandas 以 DataFrame 格式存储文档；pydantic 定义答案架构。无需向量数据库、编排框架或专门的 RAG 库。

“For a 15-page paper, the LLM can read the whole thing. Why bother with retrieval?” Fair point on this one document. We use the paper to teach the method, not to save tokens on these 15 pages. The objection often points to the Needle in a Haystack benchmark, where frontier models score near-perfectly. That benchmark is research, not practice. A needle is one isolated, verbatim fact, while enterprise questions aggregate, compare, or summarize across many passages. “对于一篇 15 页的论文，LLM 完全可以通读。为什么还要费心做检索？”对于这篇文档来说，这确实有道理。我们使用这篇论文是为了教授方法，而不是为了在这 15 页上节省 Token。这种反对意见通常指向“大海捞针”（Needle in a Haystack）基准测试，在该测试中，前沿模型几乎能达到完美得分。但该基准测试属于研究范畴，而非实践。一根“针”是一个孤立的、逐字的客观事实，而企业级问题往往涉及跨多个段落的聚合、比较或总结。

Two more practical reasons keep retrieval in the loop. Enterprise documents are often long: a 300-page insurance contract, a 500-page regulatory filing. Sending the whole thing to the LLM costs real money and dilutes its attention. And the same question runs across hundreds or thousands of documents at once. At that scale, “throw it all in” stops being a strategy. 还有两个更实际的原因促使我们保留检索环节。企业文档通常很长：300 页的保险合同、500 页的监管文件。将整份文档发送给 LLM 不仅成本高昂，还会分散其注意力。此外，同一个问题往往需要同时在成百上千份文档中运行。在这种规模下，“全部塞进去”不再是一个可行的策略。

2. The four bricks, and a PDF highlight

2. 四大模块与 PDF 高亮显示

Each step declares its inputs and outputs, and the steps are independent. The output of step N is the input of step N+1. 每个步骤都声明了其输入和输出，且各步骤相互独立。第 N 步的输出即为第 N+1 步的输入。