Making a PDF’s Images Searchable for RAG, Without Paying to Read Them All

Making a PDF’s Images Searchable for RAG, Without Paying to Read Them All

让 PDF 中的图片支持 RAG 检索,且无需为读取所有图片付费

Enterprise Document Intelligence [Vol.1 #5sexies] – image_df tells you where every picture is. Turning the few that matter into searchable text is a separate, cost-ordered job. 企业文档智能 [第 1 卷 #5sexies] —— image_df 可以告诉你每张图片的位置。将其中少数有价值的图片转化为可搜索文本,是一项需要按成本排序的独立工作。

Kezhan Shi | Jun 20, 2026 | 17 min read Kezhan Shi | 2026年6月20日 | 17 分钟阅读

This article is a document-parsing companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. It extends Article 5 (document parsing) on one table: image_df, which locates every picture in the PDF without reading any of them. This part builds the reading toolbox: a cost-ordered cascade (a cheap filter, a type check, classic OCR, a vision model) that turns the few images worth paying for into searchable text. 本文是“企业文档智能”系列中关于文档解析的补充篇,该系列旨在通过四个基础模块构建企业级 RAG 系统。本文扩展了第 5 篇文章(文档解析)中提到的一个核心表:image_df,它无需读取 PDF 中的任何图片即可定位每一张图片。本部分构建了一个读取工具箱:一个按成本排序的级联处理流程(廉价过滤器、类型检查、传统 OCR、视觉模型),将少数值得付费的图片转化为可搜索的文本。

The parsing brick gives you image_df: one row per image in the PDF, with its page, its bounding box, its size, a content hash. That locates every picture. It does not say what any of them shows. For retrieval, that is the same as not having them: a bounding box is not something a user can search, and the image’s text slot, the place a description would live, is empty. 解析模块为你提供了 image_df:PDF 中每张图片对应一行数据,包含页码、边界框、尺寸和内容哈希值。这可以定位每一张图片,但它并不能说明图片的内容。对于检索而言,这等同于没有图片:用户无法搜索边界框,而图片对应的文本槽位(即存放描述的地方)是空的。

The reflex is to throw a vision model at every image and be done. That is the wrong default. A real document is full of images that carry nothing a reader would ever search for: the company logo in every page header, a horizontal rule drawn as a 2-pixel-tall picture, a bullet glyph, a decorative banner. Captioning those with a vision LLM is paying a model to describe a logo three hundred times. 人们的本能反应是把所有图片都丢给视觉模型处理。这是一个错误的默认做法。真实的文档中充满了读者永远不会搜索的图片:每一页页眉的公司 Logo、画成 2 像素高图片的水平线、项目符号图标、装饰性横幅。用视觉大模型为这些内容生成描述,等于是在花钱让模型重复描述同一个 Logo 三百次。

So the job splits in two. First, the methods that turn an image into text, and what each one costs: a cheap filter, a type check, classic OCR, a vision model. Second, which images are actually worth spending on in a given run. That second half is driven by context. A body line that reads “Figure 3 below shows…” is the cue to read that figure with a vision model, and not its neighbours; the question being asked narrows it further. 因此,这项工作分为两部分。第一部分是确定将图片转化为文本的方法及其成本:廉价过滤器、类型检查、传统 OCR 和视觉模型。第二部分是确定在特定运行中哪些图片真正值得投入成本。后半部分由上下文驱动。正文中写着“下图 3 显示……”这一行字,就是用视觉模型读取该图而非其邻近图片的信号;而用户提出的问题则进一步缩小了范围。

1. Most images are not worth a model call

1. 大多数图片不值得调用模型

The first step spends nothing. Before any OCR or vision call, a cheap filter looks at signals already in image_df plus a couple of pixel statistics, and drops the images with no retrieval value: 第一步无需任何开销。在进行任何 OCR 或视觉模型调用之前,一个廉价的过滤器会查看 image_df 中已有的信号以及一些像素统计数据,剔除那些没有检索价值的图片:

  • Too small. An image whose shortest side is a few dozen pixels, or whose total area is below a small floor, is an icon or a bullet, not a figure. A size threshold removes most of them.
  • 太小。 最短边仅几十像素,或总面积低于阈值的图片通常是图标或项目符号,而非图表。通过尺寸阈值可以移除大部分此类图片。
  • The wrong shape. A picture that is very long and very thin is a rule or a divider, not content. An aspect-ratio guard catches those.
  • 形状不对。 非常长且非常细的图片通常是线条或分隔符,而非内容。通过长宽比限制可以捕捉到这些图片。
  • Repeated everywhere. The same content hash on most pages of the document is chrome: a header logo, a footer mark, a watermark. Counting how many pages an image hash appears on flags it as decoration, not information.
  • 到处重复。 在文档大多数页面中出现相同内容哈希值的图片通常是装饰元素:页眉 Logo、页脚标记、水印。通过统计图片哈希值出现的页数,可以将其标记为装饰而非信息。

is_worth_analyzing applies these size and shape rules per image, and flag_worth_analyzing first derives the per-page repeat frequency from the content hash, then adds a worth_analyzing column to image_df. Both live in docintel.parsing.pdf.images. is_worth_analyzing 会对每张图片应用这些尺寸和形状规则,而 flag_worth_analyzing 首先根据内容哈希值推导出每页的重复频率,然后向 image_df 添加一个 worth_analyzing 列。这两个函数都位于 docintel.parsing.pdf.images 中。

2. What kind of image is it?

2. 这是什么类型的图片?

The images that survive the filter are not all read the same way. A screenshot of a table is text: classic OCR reads it cheaply and exactly. A line chart is not text at all; its meaning is in the axes and the trend, and only a vision model can put that into words. 通过过滤器筛选后的图片并非都以相同方式读取。表格截图属于文本:传统 OCR 可以廉价且精确地读取它。折线图则完全不是文本;其含义在于坐标轴和趋势,只有视觉模型才能将其转化为文字。

So the second step classifies each kept image into one type: 因此,第二步将每张保留的图片分类为以下类型之一:

  • decorative: a blank or near-uniform panel. Skip.
  • 装饰性: 空白或近乎均匀的面板。跳过。
  • text: a screenshot, a scanned region, a table rendered as an image. Reads with OCR.
  • 文本: 截图、扫描区域、渲染为图片的表格。使用 OCR 读取。
  • chart / diagram / photo: the meaning is visual. Reads with a vision model.
  • 图表/示意图/照片: 含义是视觉化的。使用视觉模型读取。

classify_image returns one ImageType from cheap pixel signals: how much the pixels vary, how saturated they are, how much of the image is near-white background, how dense its edges are. A near-uniform panel is decorative. classify_image 通过廉价的像素信号返回一个 ImageType:像素的变化程度、饱和度、近白色背景的占比以及边缘的密度。近乎均匀的面板即为装饰性图片。