Improving the Completeness and Comparability of Segment Disclosures: A Large Language Model Approach

提升分部披露的完整性与可比性：一种大语言模型方法

Abstract: Segment-level disclosures are a central component of financial reporting, providing insight into firms’ internal organization and the allocation of economic activities across operating units. However, segment information is often presented in both qualitative and quantitative forms, dispersed across tables and narrative sections of Form 10-K filings. 摘要： 分部层面的披露是财务报告的核心组成部分，它提供了对企业内部组织结构以及各运营单位经济活动分配的洞察。然而，分部信息通常以定性和定量两种形式呈现，分散在 10-K 表格的表格和叙述性章节中。

Empirical research relying on structured databases faces both completeness and comparability challenges, as some firm-year observations may be missing, nested segment disclosures are not captured, and support for longitudinal and cross-firm comparability is limited. 依赖结构化数据库的实证研究面临着完整性和可比性方面的挑战，因为部分公司年度观测数据可能缺失，嵌套的分部披露无法被有效捕获，且对纵向（时间序列）和跨公司可比性的支持也十分有限。

This study develops a large language model-based framework to extract segment disclosures directly from Form 10-K filings and to preserve both reportable and nested segment information. We further design a retrieval augmented system that incorporates information across multiple filings to support comparability. 本研究开发了一个基于大语言模型的框架，直接从 10-K 文件中提取分部披露信息，并同时保留可报告分部和嵌套分部的信息。我们进一步设计了一个检索增强系统，通过整合多份文件中的信息来支持可比性分析。

We use two representative settings to demonstrate its application: longitudinal analysis within a firm to interpret segment changes over time, and cross-firm alignment of geographic segments across firms with different reporting structures. 我们使用了两个具有代表性的场景来展示其应用：一是公司内部的纵向分析，用于解读分部随时间的变化；二是跨公司分析，用于对齐不同报告结构公司之间的地理分部信息。

The results indicate that the artifact accurately extracts segment-level information and effectively addresses questions that require cross-period knowledge, demonstrating the potential of LLM-based approaches to enhance the measurement and interpretation of segment disclosures. 研究结果表明，该工具能够准确提取分部层面的信息，并有效解决需要跨期知识的问题，证明了基于大语言模型的方法在增强分部披露的衡量与解读方面具有巨大潜力。