Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload

Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload

使用 Docling 在本地解析 PDF 以用于 RAG:丰富的表格,无需云端上传

Enterprise Document Intelligence [Vol.1 #5ter] – Table cells, OCR, captions, headings: cloud-grade structure, running on your own machine. No key, no per-page bill, nothing leaves the building. 企业文档智能 [第1卷 #5ter] —— 表格单元格、OCR、标题、章节:云端级别的结构化能力,在您自己的机器上运行。无需密钥,无需按页付费,数据不出本地。

This article is a parsing companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. Article 5 (document parsing) built the parser with PyMuPDF (fitz). This companion keeps the same goal and the same relational tables, and swaps the engine for Docling, a richer package that recovers the table cells, OCR, and captions fitz misses, and runs entirely on your own machine. Why that last part matters is where we start. 本文是“企业文档智能”系列中关于文档解析的补充篇,该系列旨在通过四个模块构建企业级 RAG 系统。第 5 篇文章(文档解析)使用 PyMuPDF (fitz) 构建了解析器。本篇补充文章保持了相同的目标和关系表结构,但将引擎更换为 Docling。Docling 是一个功能更丰富的包,能够恢复 fitz 无法识别的表格单元格、OCR 和标题,并且完全在您自己的机器上运行。我们从为什么最后这一点至关重要开始讲起。

The richest parser you can buy reads the table, the scan, and the text trapped inside a figure. It also needs the document handed to someone else’s cloud. For a lot of enterprise work that’s a non-starter. The insurance contract on your desk, the medical record, the M&A data room, the signed employment agreement. Legal will not let those bytes leave the building, never mind cross a border into someone else’s cloud. The richest parser in the world is useless if compliance blocks the upload. 市面上功能最强大的解析器可以读取表格、扫描件以及图片中的文字,但它通常需要将文档上传到第三方的云端。对于许多企业工作而言,这是不可接受的。办公桌上的保险合同、医疗记录、并购数据室文件、已签署的雇佣协议——法务部门绝不会允许这些数据离开公司,更不用说跨越国界上传到他人的云端。如果合规性阻碍了上传,那么世界上最强大的解析器也毫无用处。

Docling is the other half of the answer. It’s IBM Research’s open-source document parser: layout detection, OCR, reading-order, and TableFormer (IBM’s deep-learning model that detects table structure without regex). All of it as a pip install. It runs on your own machine. The first call downloads the models to a local cache; every call after that is offline. No API key, no per-page charge, the document never leaves the host. And the output is the same relational tables as fitz and Azure. The downstream pipeline does not care which engine produced the dict. Retrieval, generation, annotation read rows. They never read the PDF. Docling 是解决这一问题的另一半答案。它是 IBM Research 开源的文档解析器:具备布局检测、OCR、阅读顺序识别以及 TableFormer(IBM 的深度学习模型,无需正则表达式即可检测表格结构)。所有这些功能只需通过 pip install 即可安装,且完全在您自己的机器上运行。首次调用时会将模型下载到本地缓存,之后的所有调用均可离线完成。无需 API 密钥,无需按页收费,文档永远不会离开主机。其输出结果与 fitz 和 Azure 生成的关系表格式相同。下游流水线并不关心字典是由哪个引擎生成的。检索、生成和标注模块读取的是行数据,它们从不直接读取 PDF。

1. The cloud is the constraint, not the capability

1. 云端是限制,而非能力

Article 5 bis made the case for richer parsing. Tables that keep their columns. OCR on scanned pages. Text recovered from inside figures. Headings even when the PDF has no bookmarks. None of that argument changes here. What changes is where the computation happens. Azure DI is a managed cloud service. You send it bytes, it sends back structure. For a public arXiv paper that’s fine. For the documents that fill a real enterprise archive it often isn’t: 第 5 篇补充文章论证了更丰富解析的必要性:保留列结构的表格、扫描页的 OCR、从图片中恢复的文本、即使在没有书签的 PDF 中也能识别的标题。这些论点在这里依然成立。改变的是计算发生的位置。Azure DI 是一项托管云服务,您发送字节,它返回结构。对于公开的 arXiv 论文,这没问题;但对于填充真实企业档案的文档,情况往往并非如此:

  • Confidentiality: Insurance policies, health records, contracts under NDA, anything with personal data. Sending them to a third-party API is a data-processing event that legal has to sign off on, and frequently won’t.
  • 机密性: 保险单、健康记录、保密协议下的合同以及任何包含个人数据的文件。将它们发送到第三方 API 属于数据处理行为,必须经过法务部门批准,而法务部门通常不会批准。
  • Residency: “The data stays in this region” is a contractual term in a lot of industries. A cloud parser in the wrong region breaks it.
  • 数据驻留: “数据保留在该区域”是许多行业的合同条款。在错误的区域使用云端解析器会违反该规定。
  • Air-gapped environments: Some networks have no outbound internet at all. A cloud call is not slow there, it’s impossible.
  • 物理隔离环境: 一些网络完全没有外网连接。在这些环境中,云端调用不是慢的问题,而是根本不可能实现。
  • Cost at scale: A few cents per page is nothing for a thousand pages and a real line item for ten million.
  • 规模化成本: 对于一千页文档来说,每页几美分的成本微不足道,但对于一千万页文档来说,这就是一笔巨大的开支。

Docling answers all four the same way: the model runs where the document already is. The tradeoff moves from money and trust to compute and setup. You pay in CPU seconds and a one-time model download instead of per-page fees and a compliance review. For a confidential corpus that’s the trade you want. Docling 以同样的方式解决了上述四个问题:模型在文档所在的本地运行。权衡点从金钱和信任转移到了计算资源和配置上。您支付的是 CPU 时间和一次性的模型下载成本,而不是按页付费和合规性审查。对于机密语料库来说,这正是您想要的权衡。

2. Same contract, run locally

2. 相同的契约,本地运行

One call, the same tables as the fitz parser, in the same shape, all from one local Docling conversion. The Docling SDK call itself is short: build a DocumentConverter, hand it a path, read back a DoclingDocument. The first call downloads the layout and TableFormer weights to a local cache; every call after that is offline. 只需一次调用,即可获得与 fitz 解析器相同的表格和格式,所有这些都来自一次本地的 Docling 转换。Docling SDK 的调用非常简洁:构建一个 DocumentConverter,传入路径,然后读取 DoclingDocument。首次调用会将布局和 TableFormer 权重下载到本地缓存,之后的所有调用均为离线。

from docling.document_converter import DocumentConverter
converter = DocumentConverter() # lazy: loads no model yet
result = converter.convert("data/paper/1706.03762v7.pdf")
doc = result.document # a DoclingDocument

# what one DoclingDocument exposes
doc.export_to_markdown() # full document as markdown
doc.tables # TableItem list (each carries .data.table_cells)
doc.pictures # PictureItem list (bbox + optional ocr / classification)
doc.texts # TextItem list, labelled title / section_header / paragraph / formula / caption

That DoclingDocument is what every builder in this article reads. parse_pdf_docling wraps the call above and turns the document into the same dict of tables every other engine returns, so downstream bricks read the output without knowing which engine ran. DoclingDocument 是本文中每个构建器读取的对象。parse_pdf_docling 封装了上述调用,并将文档转换为与其他引擎返回格式相同的表格字典,因此下游模块在读取输出时无需知道是哪个引擎在运行。