Agentic Retrieval-Augmented Generation for Financial Document Question Answering

Agentic Retrieval-Augmented Generation for Financial Document Question Answering

面向金融文档问答的智能体检索增强生成技术

Abstract: Financial document question answering (QA) demands complex multi-step numerical reasoning over heterogeneous evidence—structured tables, textual narratives, and footnotes—scattered across corporate filings. 摘要: 金融文档问答(QA)需要对散布在企业申报文件中的异构证据(包括结构化表格、文本叙述和脚注)进行复杂的多步数值推理。

Existing retrieval-augmented generation (RAG) approaches adopt a single-pass retrieve-then-generate paradigm that struggles with the compositional reasoning chains prevalent in financial analysis. 现有的检索增强生成(RAG)方法通常采用“先检索后生成”的单次处理范式,难以应对金融分析中常见的组合式推理链。

We propose FinAgent-RAG, an agentic RAG framework that orchestrates iterative retrieval-reasoning loops with self-verification, specifically engineered for the precision requirements of financial numerical reasoning. 我们提出了 FinAgent-RAG,这是一个智能体 RAG 框架,通过编排带有自我验证功能的迭代式“检索-推理”循环,专门针对金融数值推理的精度要求进行了工程化设计。

The framework integrates three domain-specific innovations: (1) a Contrastive Financial Retriever trained with hard negative mining to distinguish semantically similar but numerically distinct financial passages, (2) a Program-of-Thought reasoning module that generates executable Python code for precise arithmetic rather than relying on error-prone LLM-based mental computation, and (3) an Adaptive Strategy Router that dynamically allocates computational resources based on question complexity, reducing API costs by 41.3% on FinQA while preserving accuracy. 该框架集成了三项领域特定的创新:(1) 通过难负样本挖掘训练的对比金融检索器,用于区分语义相似但数值不同的金融段落;(2) “思维程序”(Program-of-Thought)推理模块,通过生成可执行的 Python 代码进行精确算术运算,而非依赖易出错的大模型心算;(3) 自适应策略路由,根据问题复杂度动态分配计算资源,在保持准确率的同时,将 FinQA 上的 API 成本降低了 41.3%。

Extensive experiments on three benchmark datasets—FinQA, ConvFinQA, and TAT-QA—demonstrate that FinAgent-RAG achieves 76.81%, 78.46%, and 74.96% execution accuracy respectively, outperforming the strongest baseline by 5.62—9.32 percentage points. 在 FinQA、ConvFinQA 和 TAT-QA 三个基准数据集上的大量实验表明,FinAgent-RAG 的执行准确率分别达到了 76.81%、78.46% 和 74.96%,比最强的基准模型高出 5.62 至 9.32 个百分点。

Ablation studies, cross-backbone evaluation with four LLMs, and deployment cost analysis confirm the framework’s robustness and practical viability for financial institutions. 消融实验、基于四种大模型的跨骨干评估以及部署成本分析,证实了该框架对于金融机构的稳健性和实际应用价值。