Benchmarking Large Language Models for Safety Data Extraction

大语言模型在安全数据提取中的基准测试

Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines.

摘要： 由于文档格式的多样性以及传统基于规则方法的局限性，从安全数据表（SDS）中准确提取结构化信息在工业安全领域仍然是一项挑战。本研究对用于自动化 SDS 数据提取的最先进大语言模型（LLM）进行了基准测试，并比较了基于文本和多模态的处理流程。

We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields.

我们系统地评估了四种模型：Gemini 1.5 Pro、GPT-4o、Claude 3.7 Sonnet 和 Llama 3.1-70B，并采用了三种提示策略：零样本（zero-shot）、少样本（few-shot）和思维链（chain-of-thought）。该评估框架针对超过 50,000 个提取的数据字段，评估了准确性、延迟和成本。

Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment.

结果表明，在所有指标上，基于文本的提取始终优于多模态处理。Gemini 1.5 Pro 结合思维链提示策略达到了最高的准确率（84%），优于 GPT-4o（81%）和 Claude 3.7 Sonnet（79%）。然而，没有任何模型超过了可靠的实际部署通常所需的 90% 准确率阈值。

These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.

这些发现表明，通用大语言模型目前尚不足以实现无人值守的工业应用，尽管其表现显示出通过特定任务微调后的巨大潜力。未来的研究应侧重于领域自适应训练、模型校准以及集成“人在回路”（Human-in-the-Loop）验证，以确保安全关键任务的可靠性。