When PyMuPDF Can’t See the Table: Parse PDFs for RAG with Azure Layout

When PyMuPDF Can’t See the Table: Parse PDFs for RAG with Azure Layout

当 PyMuPDF 无法识别表格时:使用 Azure Layout 解析 RAG 的 PDF 文档

LLM Applications When PyMuPDF Can’t See the Table: Parse PDFs for RAG with Azure Layout Enterprise Document Intelligence [Vol.1 #5bis] – The same relational tables. Native table cells. OCR for scanned pages and images. Captions and headings without regex. Kezhan Shi Jun 12, 2026 16 min read Share Photo by Amsterdam City Archives, via Unsplash. LLM 应用:当 PyMuPDF 无法识别表格时——使用 Azure Layout 解析 RAG 的 PDF 文档。企业文档智能系列 [第 1 卷 #5bis] —— 相同的关系表、原生表格单元格、针对扫描页和图像的 OCR、无需正则表达式的标题与表头。作者:Kezhan Shi,2026 年 6 月 12 日,阅读时长 16 分钟。图片来源:Amsterdam City Archives (via Unsplash)。

This article is a parsing companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. Article 5 (document parsing) built the parser with PyMuPDF (fitz). This companion keeps the same goal and the same relational tables, and swaps the engine for Azure Layout (the prebuilt-layout model), a richer package that recovers what fitz cannot. That gap is where we start. 本文是“企业文档智能”系列中关于文档解析的补充篇,该系列旨在通过四个基础模块构建企业级 RAG 系统。第 5 篇文章(文档解析)使用 PyMuPDF (fitz) 构建了解析器。本篇补充文章保持相同的目标和关系表结构,但将解析引擎替换为 Azure Layout(预构建布局模型)。这是一个功能更丰富的工具包,能够恢复 fitz 无法处理的内容,而这正是我们切入的重点。

PyMuPDF (fitz) is fast, free, and exact on clean prose. It also goes blind in three places, and each one is where enterprise RAG quietly breaks. PyMuPDF (fitz) 在处理纯文本时速度快、免费且精确。但在三个方面它会“失明”,而每一个盲点都是企业级 RAG 系统悄然失效的地方。

The table on page 14 of a contract. Fitz reads the cells one by one and concatenates them. The column structure is gone. “Renewal fee 500 Setup fee 200” lands in the chunk. Your model is asked to guess which number is which fee. 合同第 14 页的表格:Fitz 会逐个读取单元格并将它们拼接在一起,导致列结构丢失。“续费 500 设置费 200”被塞进同一个数据块中,你的模型不得不去猜测哪个数字对应哪项费用。

The scanned amendment glued to the end of the document. Fitz reads the native pages and returns empty strings on the scanned ones. The user gets no answer on the amendment because the parser never read it. 文档末尾附加的扫描版修正案:Fitz 读取原生页面,但对扫描页面返回空字符串。由于解析器从未读取过修正案,用户无法获得相关答案。

The figure with text inside. A chart with axis labels. A signed seal stamp. A screenshot of a spreadsheet. Fitz returns the bbox of the image. The text inside is gone. 包含文字的图形:带有坐标轴标签的图表、盖章的印章、电子表格截图。Fitz 只返回图像的边界框 (bbox),内部文字则完全丢失。

Azure Document Intelligence reads all three. It’s a proprietary Microsoft Azure cloud service governed by Microsoft’s Online Services Terms. The prebuilt-layout model returns native table cells (rows, columns, headers), OCR text for every page (native or scanned), figures with the text inside them, and paragraph roles (title, sectionHeading, figureCaption, tableCaption). One call. Azure Document Intelligence 可以处理上述所有情况。这是一项受微软在线服务条款约束的专有 Azure 云服务。其预构建布局模型可返回原生表格单元格(行、列、表头)、每一页的 OCR 文本(无论原生还是扫描)、包含内部文字的图形,以及段落角色(标题、章节标题、图注、表注)。只需一次调用即可完成。

1. Where fitz is blind

1. Fitz 的盲点

1.1. Tables: fitz returns flat words, Azure returns cells. 1.1. 表格:Fitz 返回扁平化的词汇,Azure 返回单元格。

A contract table has rows and columns. The label “Renewal fee” sits in column 1, the value 500 sits in column 2. Fitz reads the page top to bottom and emits one line per text segment. The four cells of a row come back as four loose words. Sometimes the cells from the row below get mixed in if the y-coordinates are close. The chunker downstream sees a soup of words. The row-and-column structure that makes a table a table is gone. 合同表格包含行和列。标签“续费”位于第 1 列,数值 500 位于第 2 列。Fitz 从上到下读取页面,并为每个文本片段输出一行。一行中的四个单元格被返回为四个松散的词。如果 y 坐标接近,下一行的单元格有时会混入其中。下游的分块器看到的是一堆乱码,表格原本的行列结构荡然无存。

Azure’s prebuilt-layout model detects each table as a structured object. result.tables is a list of tables, each with cells indexed by (row_index, column_index). The header row is flagged (cell.kind == “columnHeader”). The cell content is the cell text, exactly as the author typed it. We flatten the table into markdown rows so it lives inside line_df like any other content. Azure 的预构建布局模型将每个表格检测为一个结构化对象。result.tables 是一个表格列表,每个表格的单元格都通过 (行索引, 列索引) 进行索引。表头行会被标记 (cell.kind == "columnHeader")。单元格内容即为作者输入的原始文本。我们将表格扁平化为 Markdown 行,使其像其他内容一样存在于 line_df 中。

1.2. Images: fitz returns the bbox, Azure returns the text. 1.2. 图像:Fitz 返回边界框,Azure 返回文本。

Many PDFs have figures with text inside them. Architecture diagrams with box labels. Charts with axis ticks and legends. Signed seal stamps. Embedded screenshots of spreadsheets. Fitz returns each image as a bbox and the raw bytes. The text inside is invisible to the parser. 许多 PDF 包含带有文字的图形:带有标签的架构图、带有坐标轴刻度和图例的图表、盖章的印章、嵌入的电子表格截图。Fitz 将每张图像作为边界框和原始字节返回,内部文字对解析器而言是不可见的。

Azure’s OCR runs on every page, including the pixels inside figure regions. For each figure, we collect every Azure word whose bbox sits inside the figure region and join them as ocr_text. “Multi-Head Attention Concat Linear h” now lives in image_df.ocr_text for the figure on page 4 of the Attention paper. Retrieval can match a question about “multi-head attention” even when the answer is text inside a figure. Azure 的 OCR 在每一页上运行,包括图形区域内的像素。对于每个图形,我们收集所有边界框位于图形区域内的 Azure 词汇,并将它们连接为 ocr_text。例如,Attention 论文第 4 页图形中的“Multi-Head Attention Concat Linear h”现在存在于 image_df.ocr_text 中。即使答案位于图形内部,检索系统也能匹配到关于“multi-head attention”的问题。

1.3. Scanned pages: fitz returns nothing, Azure returns OCR. 1.3. 扫描页面:Fitz 返回空,Azure 返回 OCR 结果。

A 30-page native contract gets a 10-page scanned amendment glued at the end. Fitz reads the native pages and returns empty strings for the scanned ones. The parser does not flag this. The downstream pipeline silently covers 75% of the document. The user has no idea 25% is missing. 一份 30 页的原生合同末尾附加了 10 页扫描版修正案。Fitz 读取原生页面,但对扫描页面返回空字符串,且解析器不会发出警告。下游流水线会静默地只处理 75% 的文档,用户根本不知道缺失了 25% 的内容。

Azure runs OCR on every page regardless of source. Native pages and scanned pages come back through the same result.pages[i].lines path with the same shape. The parsing_method column on line_df lets downstream code tell which engine produced which rows. The parsing_summary dict has a n_pages field that matches the document’s actual page count, not just the pages with native text. Azure 无论来源如何,都会对每一页运行 OCR。原生页面和扫描页面通过相同的 result.pages[i].lines 路径返回,且格式一致。line_df 中的 parsing_method 列允许下游代码区分哪些行是由哪个引擎生成的。parsing_summary 字典中的 n_pages 字段与文档的实际页数匹配,而不仅仅是包含原生文本的页数。

1.4. Captions and headings: fitz uses regex, Azure has explicit roles. 1.4. 标题与表头:Fitz 使用正则表达式,Azure 拥有明确的角色定义。

Fitz detects figure / table captions by regex on the start of each line (^Figure \d+\b, ^Table \d+\b). It works when captions look like “Figure 2” and misses the rest (“Fig. 2”, multi-line wraps). It also has false positives: a body-text sentence that starts with “Figure 2” gets picked up as a caption when it is a mention. Fitz 通过每行开头的正则表达式(如 ^Figure \d+\b^Table \d+\b)来检测图表标题。当标题格式为“Figure 2”时有效,但会漏掉其他格式(如“Fig. 2”或多行换行)。它还会产生误报:正文中以“Figure 2”开头的句子会被错误地识别为标题。

Azure’s paragraphs field has role labels: each paragraph in the result carries a tag like “figureCaption”, “tableCaption”, “title”, or “sectionHeading” that tells us what kind of block it is, without any regex. “figureCaption” and “tableCaption” populate object_registry directly. Azure 的段落字段包含角色标签:结果中的每个段落都带有如 “figureCaption”、“tableCaption”、“title” 或 “sectionHeading” 等标签,无需任何正则表达式即可明确块的类型。“figureCaption” 和 “tableCaption” 会直接填充到 object_registry 中。