Agentic Retrieval-Augmented Generation for Financial Document Question Answering

面向金融文档问答的智能体检索增强生成技术

Abstract: Financial document question answering (QA) demands complex multi-step numerical reasoning over heterogeneous evidence—structured tables, textual narratives, and footnotes—scattered across corporate filings. 摘要： 金融文档问答（QA）需要对散布在企业申报文件中的异构证据（包括结构化表格、文本叙述和脚注）进行复杂的多步数值推理。

Existing retrieval-augmented generation (RAG) approaches adopt a single-pass retrieve-then-generate paradigm that struggles with the compositional reasoning chains prevalent in financial analysis. 现有的检索增强生成（RAG）方法通常采用“先检索后生成”的单次处理范式，难以应对金融分析中常见的组合式推理链。

We propose FinAgent-RAG, an agentic RAG framework that orchestrates iterative retrieval-reasoning loops with self-verification, specifically engineered for the precision requirements of financial numerical reasoning. 我们提出了 FinAgent-RAG，这是一个智能体 RAG 框架，通过编排带有自我验证功能的迭代式“检索-推理”循环，专门针对金融数值推理的精度要求进行了工程化设计。

The framework integrates three domain-specific innovations: (1) a Contrastive Financial Retriever trained with hard negative mining to distinguish semantically similar but numerically distinct financial passages, (2) a Program-of-Thought reasoning module that generates executable Python code for precise arithmetic rather than relying on error-prone LLM-based mental computation, and (3) an Adaptive Strategy Router that dynamically allocates computational resources based on question complexity, reducing API costs by 41.3% on FinQA while preserving accuracy. 该框架集成了三项领域特定的创新：(1) 通过难负样本挖掘训练的对比金融检索器，用于区分语义相似但数值不同的金融段落；(2) “思维程序”（Program-of-Thought）推理模块，通过生成可执行的 Python 代码进行精确算术运算，而非依赖易出错的大模型心算；(3) 自适应策略路由，根据问题复杂度动态分配计算资源，在保持准确率的同时，将 FinQA 上的 API 成本降低了 41.3%。

Extensive experiments on three benchmark datasets—FinQA, ConvFinQA, and TAT-QA—demonstrate that FinAgent-RAG achieves 76.81%, 78.46%, and 74.96% execution accuracy respectively, outperforming the strongest baseline by 5.62—9.32 percentage points. 在 FinQA、ConvFinQA 和 TAT-QA 三个基准数据集上的大量实验表明，FinAgent-RAG 的执行准确率分别达到了 76.81%、78.46% 和 74.96%，比最强的基准模型高出 5.62 至 9.32 个百分点。

Ablation studies, cross-backbone evaluation with four LLMs, and deployment cost analysis confirm the framework’s robustness and practical viability for financial institutions. 消融实验、基于四种大模型的跨骨干评估以及部署成本分析，证实了该框架对于金融机构的稳健性和实际应用价值。