Proxy-Pointer RAG: Solving Entity and Relationship Sprawl in Large Knowledge Graphs

Proxy-Pointer RAG：解决大型知识图谱中实体与关系的蔓延问题

A scalable semantic localization layer for entity and relationship reconciliation 一种用于实体与关系协调的可扩展语义定位层

Enterprise knowledge graphs have become the most widely used business semantic layer, providing a unified view of an organization’s suppliers, contracts, products, partners etc. As a result, they evolve organically over time to become very large, with millions of nodes (entities) and many times more edges (relations). 企业知识图谱已成为应用最广泛的业务语义层，为组织的供应商、合同、产品、合作伙伴等提供了统一视图。因此，它们随着时间的推移有机演进，变得极其庞大，拥有数百万个节点（实体）以及多出数倍的边（关系）。

Even with governance controls and ontologies in place, adherence across different pipelines feeding data into the graph is often not consistent. New business rules emerge, naming conventions change and older regions of the graph are frequently left untouched due to the sheer complexity and computational cost of upgrading them. All of this makes a large graph increasingly difficult to maintain. 即使实施了治理控制和本体论，不同数据管道向图谱输入数据时的一致性往往难以保证。新的业务规则不断涌现，命名规范发生变化，而由于升级图谱的复杂性和计算成本极高，图谱中较旧的区域往往被搁置。所有这些因素使得大型图谱的维护变得愈发困难。

One of the biggest operational problems occur at the ingestion layer. For every new document that needs to be added, a few questions emerge recurrently that need answering. Questions such as the following: Does Sony Corp already exist in the graph? And if so, by what name? Is the “Sony Corp” listed in this new document the same entity as “Sony Interactive Entertainment” already present in the graph? Or do they hold different relationships to our organization, thereby requiring a distinct, new node? What relationships do exist? 最大的运营问题之一出现在摄入层。对于每一份需要添加的新文档，都会反复出现一些需要回答的问题。例如：索尼公司（Sony Corp）是否已存在于图谱中？如果是，它的名称是什么？这份新文档中提到的“索尼公司”与图谱中已有的“索尼互动娱乐”（Sony Interactive Entertainment）是同一个实体吗？还是说它们与我们组织的关系不同，因此需要一个独立的、新的节点？到底存在哪些关系？

Semantic ambiguities (supplies, provides, is contracted for) make reconciliation increasingly difficult at scale. In the absence of an effective tool that can narrow the search space, ingestion pipelines are forced to execute expensive global graph searches to scan for variations, which degrade performance and incurs large computational costs. 语义歧义（如“供应”、“提供”、“签约”）使得在大规模场景下进行协调变得愈发困难。在缺乏有效工具来缩小搜索空间的情况下，摄入管道被迫执行昂贵的全局图谱搜索以扫描各种变体，这不仅降低了性能，还产生了巨大的计算成本。

What if there was a scalable, low-cost and fast way to scan thousands of historical documents that have already been ingested into the graph and determine the likely entities and relations before querying the knowledge graph. Even better would be to use the context thus gathered for semantic localization — telling the pipeline exactly which specific region of the graph to update, rather than forcing it to traverse the whole thing? 如果有一种可扩展、低成本且快速的方法，能够在查询知识图谱之前，扫描数千份已摄入图谱的历史文档并确定可能的实体和关系，那会怎样？如果能利用由此收集的上下文进行语义定位——告诉管道具体更新图谱的哪个区域，而不是强迫它遍历整个图谱，那就更好了。

The obvious choice for this pre-filtering step is a vector index. However, traditional Retrieval-Augmented Generation (RAG) is entirely unsuitable for this task. Standard vector chunking fragments a document into isolated snippets, with no common structural narrative. While chunks may be able to find an entity name, they strip away the surrounding context needed to accurately extract the relationships between companies, products, persons, places etc. 这种预过滤步骤的显而易见的选择是向量索引。然而，传统的检索增强生成（RAG）完全不适合这项任务。标准的向量分块将文档碎片化为孤立的片段，缺乏共同的结构叙事。虽然分块可能找到实体名称，但它们剥离了准确提取公司、产品、人员、地点等之间关系所需的周围上下文。

That is where Proxy-Pointer architecture comes in. In this article, I will demonstrate a novel approach to quickly and reliably extract entities and relationships from historical documents. By using vector matches as “pointers” to retrieve intact structural sections of a document, we can shift the burden of entity reconciliation away from the expensive Knowledge Graph, and onto a significantly faster, cheaper, and more accurate vector retrieval pipeline. 这就是 Proxy-Pointer 架构的用武之地。在本文中，我将展示一种从历史文档中快速、可靠地提取实体和关系的新颖方法。通过将向量匹配作为“指针”来检索文档中完整的结构化部分，我们可以将实体协调的负担从昂贵的知识图谱转移到速度更快、成本更低且更准确的向量检索管道上。

Quick Recap: What is Proxy-Pointer?

快速回顾：什么是 Proxy-Pointer？

Standard vector RAG splits documents into blind chunks, embeds them, and retrieves the top-K by cosine similarity. The synthesizer LLM sees fragmented, context-less text — and frequently hallucinates or misses the answer entirely. Proxy-Pointer fixes this with five zero-cost engineering techniques: 标准的向量 RAG 将文档分割成盲目的块，进行嵌入，并根据余弦相似度检索 Top-K 结果。合成大模型（LLM）看到的是碎片化、无上下文的文本，因此经常产生幻觉或完全遗漏答案。Proxy-Pointer 通过五种零成本工程技术解决了这个问题：

Skeleton Tree — Parse Markdown headings into a hierarchical tree (pure Python, no LLM needed) 骨架树 — 将 Markdown 标题解析为层级树（纯 Python 实现，无需 LLM）
Breadcrumb Injection — Prepend the full structural path (AMD > Financial Statements > Cash Flows) to every chunk before embedding 面包屑注入 — 在嵌入前，为每个块预置完整的结构路径（如：AMD > 财务报表 > 现金流量）
Structure-Guided Chunking — Split text within section boundaries, never across them 结构引导分块 — 在章节边界内拆分文本，绝不跨章节拆分
Noise Filtering — Remove distracting sections (TOC, glossary, executive summaries) from the index 噪声过滤 — 从索引中移除干扰部分（目录、术语表、执行摘要）
Pointer-Based Context — Use retrieved chunks as pointers to load the full, unbroken document section for the synthesizer 基于指针的上下文 — 使用检索到的块作为指针，为合成器加载完整、未中断的文档章节

The result: every chunk knows where it lives in the document, and the synthesizer sees complete sections — not fragments. 结果是：每个块都知道它在文档中的位置，合成器看到的是完整的章节，而不是碎片。

How Knowledge Graphs Handle Reconciliation

知识图谱如何处理协调

While it is clear why traditional vector databases are not suitable for reconciliation, it is worth examining how knowledge graphs tackle this problem. Almost all enterprise graph databases can perform semantic similarity matching over nodes and relationships. In addition, graph databases deploy a variety of tools — ontology matching, alias tables, fuzzy matching and GNN. 虽然很清楚为什么传统向量数据库不适合协调任务，但值得探讨一下知识图谱是如何解决这个问题的。几乎所有的企业级图数据库都能在节点和关系上执行语义相似度匹配。此外，图数据库还部署了多种工具——本体匹配、别名表、模糊匹配和图神经网络（GNN）。

But perhaps the most well known and widely used technique is embedding similarity. In a modern graph, the nodes and edges of a graph carry vector embeddings. And node embeddings will include not only the node name (eg: Sony Corp) but also its metadata (tags such as industry) and its localized topology (neighborhood nodes and relations). 但也许最著名且应用最广泛的技术是嵌入相似度。在现代图谱中，节点和边都带有向量嵌入。节点嵌入不仅包含节点名称（如：索尼公司），还包含其元数据（如行业标签）及其局部拓扑结构（邻近节点和关系）。

In principle, this allows the system to identify nodes that are semantically close even when names differ. For example, a graph search for: Sony + gaming ecosystem + supplier may retrieve nodes such as PlayStation ecosystem, Sony Corp or Sony Interactive Entertainment. 原则上，这使得系统即使在名称不同的情况下也能识别语义相近的节点。例如，对“索尼 + 游戏生态系统 + 供应商”进行图谱搜索，可能会检索到“PlayStation 生态系统”、“索尼公司”或“索尼互动娱乐”等节点。

However, this approach becomes increasingly difficult at enterprise scale. As the number of semantically similar entities proliferates—whether by design or due to messy historical data — it becomes increasingly difficult to predict which specific entity node is the correct target for the new relationship we are trying to ingest. 然而，这种方法在企业规模下变得越来越困难。随着语义相似实体数量的激增——无论是出于设计原因还是由于混乱的历史数据——预测哪个特定的实体节点是我们试图摄入的新关系的正确目标，变得越来越困难。

Consider this single sentence: “AMD partnered with Sony for PlayStation semi-custom SoCs” contains entity identity (AMD, Sony, Playstation) but also relationship semantics (partnered with), platform context (Playstation) and business role (semi-custom SoCs). Implicitly, this sentence maps to multiple distinct relationships: AMD is the chip designer/supplier, Sony is the platform owner/customer, and the interaction is hardware-oriented. 考虑这句话：“AMD 与索尼合作开发 PlayStation 半定制 SoC”。它不仅包含实体身份（AMD、索尼、PlayStation），还包含关系语义（合作）、平台上下文（PlayStation）和业务角色（半定制 SoC）。隐含地，这句话映射到多个不同的关系：AMD 是芯片设计商/供应商，索尼是平台所有者/客户，且交互是面向硬件的。