Information Extraction from Electricity Invoices with General-Purpose Large Language Models

利用通用大语言模型从电费账单中提取信息

Abstract: Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning.

摘要： 从半结构化商业文档中提取信息仍然是企业管理面临的一项严峻挑战。本研究评估了通用大语言模型（LLM）在无需针对特定任务进行微调的情况下，从西班牙语电费账单中提取结构化信息的能力。

Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies.

我们使用 IDSEM 数据集的一个子集，对两种架构迥异的模型（Gemini 1.5 Pro 和 Mistral-small）进行了基准测试，涵盖了 19 种参数配置和 6 种提示词（Prompting）策略。我们的实验框架将提示词工程视为主要的实验变量，对比了零样本（Zero-shot）基准与日益复杂的少样本（Few-shot）方法及迭代提取策略。

Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty.

结果表明，提示词的质量对结果的影响远超超参数调整：在所有参数配置下，F1 分数的波动微乎其微，而零样本与最佳少样本策略之间的差距超过了 19 个百分点。最佳配置（采用交叉验证的少样本策略）使 Gemini 的 F1 分数达到 97.61%，Mistral-small 达到 96.11%，其中文档模板结构是决定提取难度的主要因素。

These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.

这些发现确立了提示词设计是最大化基于 LLM 的文档处理提取保真度的关键杠杆，从而为将通用大语言模型集成到商业文档自动化流程中提供了实证框架。