The Untaught Lessons of RAG Question Parsing: Structure Before You Search
The Untaught Lessons of RAG Question Parsing: Structure Before You Search
RAG 问题解析中未被传授的经验:搜索前先构建结构
Enterprise Document Intelligence [Vol.1 #6ter] – Six positions on the question-parsing brick that contradict the mainstream RAG playbook. 企业文档智能 [第1卷 #6ter] —— 关于问题解析模块的六个观点,它们与主流 RAG 方案背道而驰。
This article is a manifesto companion to Enterprise Document Intelligence, the series whose philosophy is laid out in Amplify the Expert. It zooms in on brick 2 (question parsing) of the four-brick architecture and surfaces the lessons most tutorials skip. 本文是《企业文档智能》系列的宣言式配套文章,该系列的哲学理念在《Amplify the Expert》一文中已有阐述。本文聚焦于四模块架构中的第二个模块(问题解析),并揭示了大多数教程所忽略的经验。
Most RAG tutorials skip question parsing. The user’s string goes straight to retrieval, cosine runs on top-k, and the model gets handed whatever came back. We do not do that, for one reason: a user question is not a search query. Treat it as one and you get silent partial answers, and in production that is where a lot of RAG quietly breaks. 大多数 RAG 教程都跳过了问题解析。用户的字符串直接进入检索环节,通过余弦相似度计算 top-k,然后模型接收返回的任何内容。我们不这样做,原因只有一个:用户的问题不等同于搜索查询。如果将其视为搜索查询,你往往会得到不完整的答案,而在生产环境中,这正是许多 RAG 系统悄然失效的地方。
The naive baseline this article pushes back on
本文所反驳的“朴素基准”
The naive pipeline embeds the user string and asks the vector store for the top-k most similar chunks. Nothing in that setup knows the question had two parts, or that the user wanted an exact value and not a paragraph. So we spend one extra brick on the question itself: a row in question_df with five typed columns (keywords, scope, shape, decomposition, clarification) plus satellite tables, and two derived briefs (RetrievalQuery for the retrieval brick, GenerationBrief for the generation brick).
朴素的流水线直接对用户字符串进行嵌入(embedding),并向向量数据库请求 top-k 个最相似的文本块。这种设置完全无法感知问题包含两个部分,也无法识别用户想要的是精确数值而非一段文字。因此,我们专门为“问题”本身增加了一个模块:在 question_df 中创建一个包含五个类型化列(关键词、范围、形态、分解、澄清)的行,外加辅助表,并衍生出两个简报(用于检索模块的 RetrievalQuery 和用于生成模块的 GenerationBrief)。
The anatomy diagram shows the five core columns, but a production question_df carries two more that decide how wide a window retrieval will pass to generation. The context discipline is measured in lines (not characters, too noisy; not pages, too coarse). The table below shows three sample rows: one factual lookup, one yes/no boolean, one listing question. Each row sizes its context window differently, by reading the answer shape and the decomposition pattern.
解剖图展示了五个核心列,但生产环境中的 question_df 还包含另外两列,用于决定检索传递给生成模块的窗口宽度。上下文的规范以“行”为单位(不以字符计,因为太嘈杂;不以页计,因为太粗糙)。下表展示了三个示例行:一个事实查询、一个是非判断、一个列表问题。每一行都通过读取答案形态和分解模式,来调整其上下文窗口的大小。
Lesson 1 – A relational schema, symmetric to the document side
经验 1:关系型模式,与文档侧保持对称
The literature has “query understanding” and “query rewriting”, but both treat the question as a string turned into another string. Modeling it as a row in question_df plus satellite tables is not how people usually frame it. What makes it click is the symmetry with the document side (line_df, toc_df, span_df): both sides are relational, both join, and retrieval becomes a filter across them.
文献中常提到“查询理解”和“查询重写”,但两者都将问题视为从一个字符串转换为另一个字符串。将其建模为 question_df 中的一行加上辅助表,并不是人们通常的构思方式。其精妙之处在于与文档侧(line_df, toc_df, span_df)的对称性:双方都是关系型的,都可以进行连接(join),而检索则变成了跨越这些表的一个过滤器。
Why it matters. Most production pipelines store the question as a single string inside the LLM prompt template. There is no notion of “the question has a shape”, “the question has a scope”, “the question has a decomposition”. When the team needs a new capability (handle negation, handle compound questions, handle ranges), the only place to add it is the prompt template. Six months in, the prompt carries sixty lines of special-case clauses none of which the audit can trace. Structuring the question once at the parser boundary, the way parsing structures the document at its boundary, removes that rot at its source. 为什么这很重要。 大多数生产流水线将问题作为单个字符串存储在 LLM 的提示词模板中。这里没有“问题具有形态”、“问题具有范围”、“问题具有分解”的概念。当团队需要新功能(处理否定、处理复合问题、处理范围)时,唯一能添加的地方就是提示词模板。六个月后,提示词中充斥着六十行特殊情况的从句,而审计人员根本无法追踪。在解析器边界处对问题进行结构化处理,就像在边界处对文档进行解析一样,从源头上消除了这种混乱。
Lesson 2 – A schema, not branching code
经验 2:使用模式,而非分支代码
Most RAG codebases grow the question-handling logic as branching code, gated by if intent == "..." chains that ossify over months. We grow the brick as a schema instead: a new capability is a column added to question_df, edited by the expert, not a new code path. The cost of a new feature stays linear in the number of columns, not quadratic in branch combinations.
大多数 RAG 代码库通过分支代码来扩展问题处理逻辑,受 if intent == "..." 链的限制,这些代码会在几个月内变得僵化。我们则将该模块构建为一种模式:一项新功能只需在 question_df 中增加一列,由专家进行编辑,而不是增加新的代码路径。新功能的成本与列数呈线性关系,而不是与分支组合呈二次方关系。