Benchmarking Large Language Models for Safety Data Extraction
Benchmarking Large Language Models for Safety Data Extraction
大语言模型在安全数据提取中的基准测试
Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines.
摘要: 由于文档格式的多样性以及传统基于规则方法的局限性,从安全数据表(SDS)中准确提取结构化信息在工业安全领域仍然是一项挑战。本研究对用于自动化 SDS 数据提取的最先进大语言模型(LLM)进行了基准测试,并比较了基于文本和多模态的处理流程。
We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields.
我们系统地评估了四种模型:Gemini 1.5 Pro、GPT-4o、Claude 3.7 Sonnet 和 Llama 3.1-70B,并采用了三种提示策略:零样本(zero-shot)、少样本(few-shot)和思维链(chain-of-thought)。该评估框架针对超过 50,000 个提取的数据字段,评估了准确性、延迟和成本。
Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment.
结果表明,在所有指标上,基于文本的提取始终优于多模态处理。Gemini 1.5 Pro 结合思维链提示策略达到了最高的准确率(84%),优于 GPT-4o(81%)和 Claude 3.7 Sonnet(79%)。然而,没有任何模型超过了可靠的实际部署通常所需的 90% 准确率阈值。
These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.
这些发现表明,通用大语言模型目前尚不足以实现无人值守的工业应用,尽管其表现显示出通过特定任务微调后的巨大潜力。未来的研究应侧重于领域自适应训练、模型校准以及集成“人在回路”(Human-in-the-Loop)验证,以确保安全关键任务的可靠性。