Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

Proxy-Pointer RAG:消除知识图谱中低效的实体与关系抽取

Structure-guided NER optimization for enterprise GraphRAG systems. 面向企业级 GraphRAG 系统的结构引导式命名实体识别(NER)优化。

In my previous article on Solving Entity and Relationship Sprawl in Knowledge Graphs, I discussed how Proxy-Pointer architecture can optimize searching for right entities and relations. That, however, is only the second part of a larger problem in graph ingestion. The bigger—and far more expensive—step is identifying those entities (NER) and relations in the first place. 在我之前关于《解决知识图谱中实体与关系蔓延问题》的文章中,我讨论了 Proxy-Pointer 架构如何优化对正确实体和关系的搜索。然而,这只是图谱摄入(Graph Ingestion)这一更大问题中的第二部分。更重要且成本高昂的步骤,首先是识别这些实体(NER)和关系。

Knowledge Graphs are built to answer complex aggregation and multi-hop queries across entities and relationships over similar documents — vendor contracts, compliance manuals, credit agreements, global terms and conditions, etc. These documents are routinely over 100 pages long with dense text exceeding 500k characters. Enterprises frequently ingest thousands of similar contracts from the same suppliers and customers. To do that, each of these documents is passed through a powerful LLM for NER and relations extraction, burning millions of tokens even before the actual graph ingestion can happen. 知识图谱旨在回答跨文档的复杂聚合和多跳查询,这些文档包括供应商合同、合规手册、信贷协议、全球条款与条件等。这些文档通常超过 100 页,且包含超过 50 万字符的密集文本。企业经常需要摄入来自同一供应商和客户的数千份类似合同。为此,每一份文档都必须通过强大的大语言模型(LLM)进行 NER 和关系抽取,在实际的图谱摄入开始之前,就已经消耗了数百万个 Token。

The process has to be repeated sometimes, since long-context extraction often suffers from reduced recall consistency and increased extraction variance. However, the crucial fact is that legal documents such as contracts, have very similar structure across organizations, even across industries. And they are packed with dense boilerplate text, schedules, exhibit etc most of which are of little value for NER, yet still have to be seen by a LLM anyway. 有时这一过程必须重复进行,因为长上下文抽取往往面临召回一致性下降和抽取方差增大的问题。然而,关键事实在于,合同等法律文档在不同组织甚至不同行业间具有非常相似的结构。它们充斥着密集的样板文本、附表、附件等,其中大部分对 NER 价值甚微,但 LLM 却不得不逐一阅读。

But what if we could exploit this structural predictability? What if we could predict the value of a section before we ever send it to the LLM, drastically cutting ingestion costs by strategically ignoring the noise? In this article, we will explore a novel approach to minimizing the content seen by the LLM. By leveraging the structural concepts of Proxy-Pointer RAG and introducing a predictive metric called Graphability Indexing, we can selectively bypass low-yield sections of dense documents. 如果我们能利用这种结构的可预测性会怎样?如果我们能在将文档发送给 LLM 之前就预测出某个章节的价值,通过策略性地忽略噪声来大幅降低摄入成本,又会如何?在本文中,我们将探索一种最小化 LLM 处理内容量的新方法。通过利用 Proxy-Pointer RAG 的结构化概念,并引入一种称为“图谱化索引(Graphability Indexing)”的预测指标,我们可以有选择地跳过密集文档中的低收益章节。

Quick Recap: What is Proxy-Pointer?

快速回顾:什么是 Proxy-Pointer?

Proxy-Pointer is an structure-aware RAG technique that delivers surgical precision over complex documents such as annual reports, credit agreements, etc. at the cost of standard Vector RAG. Standard vector RAG splits documents into blind chunks, embeds them, and retrieves the top-K by cosine similarity. Even with overlap and semantic chunking, this is not a reliable method for relationship extraction in enterprise KGs as chunks fragment the context of a document, making extraction prone to hallucination. Proxy-Pointer 是一种具备结构感知能力的 RAG 技术,它能以标准向量 RAG 的成本,对年度报告、信贷协议等复杂文档实现手术级的精准度。标准向量 RAG 将文档分割成盲目切片(Chunks),进行嵌入并根据余弦相似度检索 Top-K。即使使用了重叠和语义分块,这对于企业级知识图谱的关系抽取来说也不是一种可靠的方法,因为切片会碎片化文档上下文,导致抽取过程容易产生幻觉。

Instead, Proxy-Pointer treats a document as a tree of self-contained semantic blocks (sections). Context is encapsulated within each section and therefore these are good candidates for relations extraction. Also, a LLM is much more likely to accurately identify the entities and relations from a section in a single pass, rather than from a full 100 page document, making repeated scans unnecessary. 相反,Proxy-Pointer 将文档视为一棵由自包含语义块(章节)组成的树。上下文被封装在每个章节内,因此它们是关系抽取的理想候选对象。此外,LLM 在单次扫描中从一个章节准确识别实体和关系的可能性,远高于从完整的 100 页文档中识别,从而无需重复扫描。

Existing methods for NER optimization

现有的 NER 优化方法

Traditional NLP / Pre-Trained Models (e.g., spaCy): A common first approach is to use lightweight, traditional NLP pipelines like spaCy along with a LLM in a Funnel approach. These models are extremely fast and cheap, pre-trained to recognize standard entities (Persons, Organizations, Locations, Dates), and are used to scan a document for entity hotspot regions. The hotspots are then scanned using a LLM in a focused manner. 传统 NLP / 预训练模型(如 spaCy): 一种常见的初步方法是采用“漏斗式”策略,即结合使用轻量级的传统 NLP 流水线(如 spaCy)和 LLM。这些模型速度极快且成本低廉,经过预训练可识别标准实体(人名、组织、地点、日期),并用于扫描文档中的实体热点区域。随后,再使用 LLM 对这些热点区域进行聚焦扫描。

LLM Pre-Scanning (Smaller Router Models): Another approach is to use a smaller, cheaper LLM to quickly pre-scan chunks and decide if they contain valuable relationships, before sending only the high-value chunks to a large reasoning model for deep extraction. While cheaper per token, we are still forcing a model to read every word of a 500k character document. And this is also therefore, a wasteful double scan of large parts of the document. LLM 预扫描(小型路由模型): 另一种方法是使用更小、更便宜的 LLM 快速预扫描切片,判断其是否包含有价值的关系,然后再将高价值切片发送给大型推理模型进行深度抽取。虽然单位 Token 成本较低,但我们仍然强迫模型阅读 50 万字符文档中的每一个字。因此,这对于文档的大部分内容来说,仍然是一种浪费性的二次扫描。

Proxy-Pointer Approach

Proxy-Pointer 方法

As mentioned earlier, Proxy-Pointer leverages the following properties of knowledge graphs: Graphs are built for a domain/functional area, and therefore store similar document content. A procurement graph will ingest multiple supplier contracts (and also many contracts of same supplier), a finance graph will have many lender and credit documents, compliance documents etc. The documents share a similar baseline structure — sections, schedules, exhibits etc. And only a fraction of the content is enough for… 如前所述,Proxy-Pointer 利用了知识图谱的以下特性:图谱是为特定领域/功能区域构建的,因此存储的内容相似。采购图谱会摄入多个供应商合同(以及同一供应商的多份合同),金融图谱则包含许多贷款和信贷文档、合规文档等。这些文档共享相似的基准结构——章节、附表、附件等。而其中只有一小部分内容足以用于……