Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section

Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section

重构 PDF 缺失的目录,助力 RAG 实现按章节检索

Enterprise Document Intelligence [Vol.1 #5septies] – When a PDF prints a contents page but exposes no outline, two ways to turn it back into structure, plus the page-alignment step everyone forgets 企业文档智能 [第1卷 #5septies] —— 当 PDF 打印了目录页却未提供大纲时,有两种方法可以将其还原为结构化数据,外加一个常被忽略的页面对齐步骤

Kezhan Shi | Jun 21, 2026 | 14 min read Kezhan Shi | 2026年6月21日 | 14分钟阅读

This article is a document parsing companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. It extends Article 5 (document parsing) on one table: toc_df, the document’s section structure, which Article 5 fills from the PDF’s native outline (PyMuPDF’s doc.get_toc) when there is one. This part is about the case where there isn’t, reconstructing that structure from what the document still shows on the page. 本文是《企业文档智能》系列中关于文档解析的补充篇,该系列旨在通过四个基础模块构建企业级 RAG 系统。本文是对第 5 篇(文档解析)中 toc_df 表的扩展。toc_df 存储了文档的章节结构,在第 5 篇中,它是通过 PDF 原生大纲(使用 PyMuPDFdoc.get_toc)填充的。本文探讨的是当原生大纲缺失时,如何从文档页面上可见的内容中重构该结构。

Open NIST FIPS 202, the SHA-3 standard, and turn to page seven. There is a clean table of contents: section titles on the left, page numbers on the right. Now open the same file in any PDF viewer and look at the bookmarks pane. Empty. The contents page is ink on a page, not structure the machine can use. The author wrote a perfectly good table of contents, and the file shipped without exposing it. 打开 NIST FIPS 202(SHA-3 标准),翻到第七页。这里有一个清晰的目录:左侧是章节标题,右侧是页码。现在用任何 PDF 阅读器打开同一个文件,查看书签面板。它是空的。目录页只是页面上的墨迹,而非机器可用的结构。作者编写了一个完美的目录,但文件发布时却没有将其导出为结构化数据。

Article 5 (document parsing) and Article 5B (the relational data model) leaned on doc.get_toc(), the PDF’s native outline, to fill toc_df. It is exact when it exists. It often does not. Plenty of real documents, papers exported straight from LaTeX, contracts printed to PDF, government standards, carry a printed contents page but no outline. For those, toc_df comes back empty, even though the document is telling you its structure in plain sight on page seven. 第 5 篇(文档解析)和第 5B 篇(关系数据模型)依赖 doc.get_toc()(PDF 原生大纲)来填充 toc_df。当它存在时,它是准确的。但它经常缺失。许多真实文档,如直接从 LaTeX 导出的论文、打印成 PDF 的合同、政府标准,都带有打印的目录页,却没有大纲。对于这些文档,toc_df 返回为空,尽管文档在第七页已经清晰地展示了其结构。

That structure is not a nicety. Retrieval scopes by section (Article 7). The chunker cuts on heading boundaries (Article 5B). Summarization walks the document section by section. Every one of those steps reads toc_df. When it is empty, retrieval falls back to scanning every page, the chunker splits on blind page breaks, and the answer loses the document’s own structure. 这种结构并非可有可无。检索需要按章节进行范围限定(第 7 篇)。分块器(Chunker)在标题边界处进行切分(第 5B 篇)。摘要生成则按章节遍历文档。所有这些步骤都会读取 toc_df。当它为空时,检索将退化为扫描每一页,分块器只能盲目地在分页符处切分,导致生成的答案丢失了文档原有的结构。

So the question this article answers is narrow and practical: when the file ships no outline but prints a contents page, how do you turn that page back into a toc_df? 因此,本文要解决的问题非常具体且实用:当文件没有大纲但打印了目录页时,如何将该页面还原为 toc_df

One thing up front, because it is easy to conflate. This is about documents that have a contents page. A document with no contents page at all, a paper that just opens with “1. Introduction”, a five-page memo, an export that stripped every heading, is a different problem. Recovering a skeleton from the body of an unstructured document is summarization, a separate intent that builds the map from the chunks rather than reading one off a page. Here we only ever read a contents page the document already has. 首先说明一点,因为这很容易混淆。本文讨论的是那些拥有目录页的文档。对于完全没有目录页的文档(例如直接以“1. 引言”开头的论文、五页的备忘录、或删除了所有标题的导出文件),那是另一个问题。从非结构化文档的正文中恢复骨架属于摘要生成范畴,这是一个通过分块构建映射的独立任务,而不是从页面上读取现成的目录。在这里,我们只处理文档本身已有的目录页。

1. Two halves: read the entries, then find their real pages

1. 分为两部分:读取条目,然后查找其实际页码

It helps to separate two things a contents page gives you. The first is a list of sections with titles and a hierarchy: what the document is about, in what order. The second is a map from each section to where it physically starts in the file. The native outline hands you both for free. Reading a printed contents page hands you the first directly, but the second only as printed labels, which are not physical pages. The two halves have different failure modes, so the rest of this article keeps them separate: first read the entries, then align them to physical pages. 将目录页提供的两类信息分开处理会很有帮助。第一类是带有标题和层级的章节列表:文档的内容及其顺序。第二类是从每个章节到其在文件中物理起始位置的映射。原生大纲可以免费提供这两者。读取打印的目录页可以直接获得第一类信息,但第二类信息仅以打印标签的形式存在,它们并非物理页码。这两部分的失败模式不同,因此本文后续部分将它们分开处理:首先读取条目,然后将其与物理页码对齐。

In: a PDF whose doc.get_toc() returns nothing but that prints a contents page. Out: a toc_df with the same shape Article 5B defined (level, title, start_page, end_page, breadcrumb), so everything downstream keeps working unchanged. 输入:一个 doc.get_toc() 返回为空但打印了目录页的 PDF。输出:一个与第 5B 篇定义的结构(level, title, start_page, end_page, breadcrumb)相同的 toc_df,从而确保下游的所有流程都能正常运行。

2. Three cases, by ascending cost

2. 三种情况,按成本递增排序

The cascade tries each case in turn and stops at the first that yields a usable TOC. Each case has a detection step and an extraction step, and falls through to the next when it fails or returns too little. 级联处理会依次尝试每种情况,并在获得可用目录时停止。每种情况都包含检测步骤和提取步骤,如果失败或返回结果过少,则进入下一种情况。

  • Case 1, native outline. Handled in Article 5 by build_toc_df. Free, exact, hierarchical. When it works there is nothing to do. We recap it only to set the cost baseline.
  • 情况 1:原生大纲。 在第 5 篇中通过 build_toc_df 处理。免费、精确、层级分明。如果可行,则无需额外操作。我们重提它只是为了设定成本基准。
  • Case 2, contents page with links. No outline, but an early page lists titles as hyperlinks pointing inside the file. The link target is the physical page, so this case skips the alignment problem entirely.
  • 情况 2:带链接的目录页。 没有大纲,但前几页列出的标题是指向文件内部的超链接。链接目标即为物理页码,因此这种情况完全跳过了对齐问题。
  • Case 3, contents page without links. A page that looks like a printed contents (titles, dot leaders, right-aligned page numbers) but carries no links. The page numbers it prints are labels in the document’s own numbering, not physical pages, so this case needs the alignment step.
  • 情况 3:无链接的目录页。 页面看起来像打印的目录(标题、引导点、右对齐页码),但不包含链接。它打印的页码是文档自身的编号标签,而非物理页码,因此这种情况需要执行对齐步骤。

All of this lives in a module of its own, separate from the native path so Article 5 stays readable. The entry point is reconstruct_toc_df. 所有这些逻辑都封装在一个独立的模块中,与原生路径分离,以保持第 5 篇的可读性。入口函数是 reconstruct_toc_df

3. 追踪链接

Case 2 is the lucky one. Some documents have no outline but do ship a clickable contents page. The NIST Cybersecurity Framework is one: page two lists every section as a hyperlink that jumps into the document. PyMuPDF exposes those links per page, and each internal link carries its target page directly. 情况 2 是最幸运的。有些文档没有大纲,但确实提供了可点击的目录页。NIST 网络安全框架就是一例:第二页将每个章节列为跳转到文档内部的超链接。PyMuPDF 会按页面暴露这些链接,每个内部链接都直接携带其目标页码。

In: the PDF (links are not in line_df, so this reader opens the file). Out: entries with a title and the physical target page, already resolved. The detection is a density check: a page with five or more internal links is a navigation page, not a body page with the odd footnote link. The extraction joins each link’s clickable rectangle back to the text under it, then strips the leaders and the trailing page label. 输入:PDF 文件(链接不在 line_df 中,因此该读取器需要打开文件)。输出:带有标题和已解析物理目标页码的条目。检测方法是密度检查:包含五个或更多内部链接的页面即为导航页,而非带有零星脚注链接的正文页。提取过程将每个链接的可点击矩形区域与下方的文本关联起来,然后去除引导点和末尾的页码标签。

import fitz # PyMuPDF

def extract_toc_from_links(pdf_path, min_links=5):
    """The contents page is the page carrying the most internal links."""
    doc = fitz.open(pdf_path)
    best = []
    for page in doc:
        entries = []
        for link in page.get_links()