Vision LLMs are PDF Parsers Too: Reading Charts and Diagrams for RAG
Vision LLMs are PDF Parsers Too: Reading Charts and Diagrams for RAG
视觉大模型也是 PDF 解析器:为 RAG 读取图表与示意图
LLM Applications Vision LLMs are PDF Parsers Too: Reading Charts and Diagrams for RAG Enterprise Document Intelligence [Vol.1 #5quater] – The other parsers read the words on a page. A vision model also reads the pictures. 大模型应用:视觉大模型也是 PDF 解析器——为 RAG 读取图表与示意图。企业文档智能 [第 1 卷 #5quater] —— 其他解析器读取页面上的文字,而视觉模型还能读取图片。
Kezhan Shi Jun 14, 2026 15 min read. This article is a parsing companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. Article 5 (document parsing) built the parser with PyMuPDF (fitz), which reads the words on a page. This companion swaps the engine for a vision LLM that reads the page as an image, so it gives you the words plus the one thing the text parsers cannot, the content of the pictures. Kezhan Shi,2026 年 6 月 14 日,阅读时长 15 分钟。本文是“企业文档智能”系列的解析篇补充,该系列旨在通过四个模块构建企业级 RAG 系统。第 5 篇文章(文档解析)使用 PyMuPDF (fitz) 构建了解析器,用于读取页面文字。本文作为补充,将引擎替换为视觉大模型,将页面作为图像进行读取,从而在获取文字的同时,还能获取文本解析器无法处理的内容——即图片信息。
Show a PDF parser a chart and it sees an empty box. The text engines, native or cloud or local, all find the words on a page and put them in searchable tables. A chart has no words, so to every one of them the region is blank, and to a retrieval system it does not exist. 当向 PDF 解析器展示图表时,它只能看到一个空白框。无论是原生、云端还是本地的文本引擎,它们都只能找到页面上的文字并将其放入可搜索的表格中。由于图表没有文字,对它们而言该区域是空白的,对于检索系统来说,它根本不存在。
A vision model is different. It looks at the page the way a person would. Ask it for the text and it gives you the text and the tables, just like the others. Show it a chart and it tells you what the chart says, in plain words you can search. That last part is what the others can’t do. 视觉模型则不同。它像人类一样观察页面。要求它提取文本时,它能像其他引擎一样提供文字和表格;而展示图表时,它能用你可以搜索的通俗语言告诉你图表表达的内容。最后这一点是其他引擎无法做到的。
The catch: it is slower, costs more, and reads numbers off a chart only roughly. It is also only as good as the model you pick. gpt-4.1 reads a chart that the cheaper gpt-4o-mini half-misses. So you don’t use it everywhere. You save it for the pages that are mostly pictures, where the other parsers come back empty. 代价是:它速度更慢、成本更高,且读取图表中的数字仅为近似值。此外,其效果完全取决于你选择的模型。gpt-4.1 能读取图表,而更便宜的 gpt-4o-mini 却会漏掉一半。因此,你不能在所有地方都使用它,应将其留给那些以图片为主、其他解析器无法处理的页面。
1. The one thing only a vision model can do: make an image searchable
1. 视觉模型独有的功能:让图像可搜索
Start with the reason this parser exists at all. The textual engines turn a page into the relational tables from the earlier articles, but a figure defeats them: they return a chart as a bounding box in image_df with maybe a stray axis label. There is no text in a chart, so to OCR and to a layout model the region is empty, and to a retrieval system it does not exist. 首先说明该解析器存在的原因。文本引擎将页面转换为之前文章中提到的关系表,但图表却难倒了它们:它们通常只将图表作为 image_df 中的一个边界框返回,可能带有一个零散的坐标轴标签。图表中没有文本,因此对于 OCR 和布局模型来说该区域是空的,对于检索系统来说它不存在。
A vision model reads the picture. Below are three figures pulled straight out of two PDFs: the Transformer diagrams from Attention Is All You Need (Vaswani et al. 2017) and the commodity-price charts from the World Bank Commodity Markets Outlook (April 2026 issue). Each figure sits next to the one-sentence description gpt-4.1 wrote for it. 视觉模型则能读取图片。以下是从两份 PDF 中直接提取的三张图:来自《Attention Is All You Need》(Vaswani 等人,2017) 的 Transformer 示意图,以及来自世界银行《大宗商品市场展望》(2026 年 4 月刊) 的大宗商品价格图表。每张图旁边都有 gpt-4.1 为其撰写的一句描述。
The price chart is now a sentence: commodity price indices by sector, falling since their 2022 peak. A user searching for “commodity price index since 2022” can now hit that page. Before, there was nothing on it to match. 价格图表现在变成了一句话:按行业划分的大宗商品价格指数,自 2022 年峰值以来持续下跌。用户搜索“2022 年以来的大宗商品价格指数”现在可以定位到该页面。而在以前,页面上没有任何内容可以匹配。
Here is the argument in its sharpest form. Picture a satellite image of a parking lot. It has no text at all. OCR finds nothing, layout finds one box, and to a retrieval system the image does not exist. A vision model writes “aerial view of a parking lot, roughly half full, around forty cars”. Now a search for parking occupancy finds it. That sentence is the parse, and only a vision model can produce it. 这是最直观的论点:想象一张停车场的卫星图像。它没有任何文字。OCR 找不到任何东西,布局分析只能找到一个框,对检索系统来说该图像不存在。而视觉模型会写道:“停车场的鸟瞰图,大约半满,约有四十辆车”。现在,搜索“停车场占用率”就能找到它。那句话就是解析结果,只有视觉模型能生成它。
2. It also parses text and tables, like the others
2. 它也能像其他引擎一样解析文本和表格
The figure is the unique part, but a parser that only read pictures would be useless. A vision model reads the text and the tables too, and not worse than the textual engines on clean material. We pointed parse_page_vision at page 30 of the NIST Cybersecurity Framework, the Framework Core table, and asked for markdown. It returned the table columns intact, merged cells handled (the Function name sits on the first row of its block and the continuation rows leave it blank). 图表是其独特之处,但如果解析器只能读取图片,那将毫无用处。视觉模型也能读取文本和表格,在处理清晰的资料时,其表现并不逊色于文本引擎。我们将 parse_page_vision 指向 NIST 网络安全框架第 30 页的“框架核心”表格,并要求输出 Markdown 格式。它完整地返回了表格列,并处理了合并单元格(功能名称位于其块的第一行,后续行留空)。
This is the same cell structure Docling and Azure produce from the same page in the two previous articles: they emit markdown tables too, so the format is not what sets vision apart. The vision model never built a table object; it read the grid off the picture and wrote markdown (it returns HTML just as well). So the claim from the lead holds: it is a parser, returning the reusable model the others return, plus the figures they cannot. 这与前两篇文章中 Docling 和 Azure 从同一页面生成的单元格结构相同:它们也输出 Markdown 表格,因此格式并不是视觉模型的优势所在。视觉模型并没有构建表格对象;它是从图片中读取网格并编写 Markdown(它同样可以返回 HTML)。因此,开篇的结论成立:它是一个解析器,既能返回其他引擎所能提供的可重用模型,又能处理它们无法处理的图表。
3. The model matters: gpt-4o-mini misses charts that gpt-4.1 reads
3. 模型选择至关重要:gpt-4o-mini 会漏掉 gpt-4.1 能读取的图表
How good the parse is depends heavily on the model, and the gap shows precisely where it counts, on the figures. We ran the same CMO chart page through gpt-4o-mini and gpt-4.1. Both read the page text and the table; on the charts the cheaper model finds half. 解析效果在很大程度上取决于模型,而差距恰恰体现在关键的图表上。我们使用 gpt-4o-mini 和 gpt-4.1 运行了同一份 CMO 图表页面。两者都能读取页面文本和表格;但在图表方面,较便宜的模型只能识别出一半。
gpt-4o-mini found three of the six charts and labelled two of them as tables. gpt-4.1 found all six and transcribed their axes down to the month, including the policy-uncertainty and temperature-anomaly charts the smaller model missed. Both read the page text and the NIST table correctly. The weaker model fell down on the pictures, the one thing you brought vision in to do. gpt-4o-mini 在六张图表中只找到了三张,并将其中两张标记为表格。gpt-4.1 找全了六张,并精确转录了坐标轴到月份,包括较小模型漏掉的政策不确定性和温度异常图表。两者都能正确读取页面文本和 NIST 表格。较弱的模型在图片处理上表现不佳,而这正是你引入视觉模型的原因。
4. The honest trade: exactness and cost
4. 诚实的权衡:精确度与成本
None of this is free, and the catch is worth naming plainly. It is not that vision “isn’t really parsing”, because it is. It is that the parse is less exact and costs more per page. Two costs stand out. Exactness, with two faces: The values it reads off a curve are approximate: the shape and the gist are right, a specific tick can be off, so treat a transcribed number as a lead, not a fact. 这一切并非免费,其代价值得明确指出。并不是说视觉模型“不是真正的解析”,它确实是。问题在于解析的精确度较低,且每页成本更高。有两个成本尤为突出。精确度有两个方面:它从曲线中读取的数值是近似的:形状和要点是正确的,但具体的刻度可能会有偏差,因此请将转录的数字视为线索,而非事实。