Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

文档 AI 的落地实践：生产环境中 OCR 与大模型流水线的微服务架构

Abstract: Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour.

摘要： 学术研究往往侧重于文档理解的新模型，这导致在文献中，模型定义与生产规模化运行之间存在巨大鸿沟。为了弥合这一差距，我们提出了一种微服务架构，该架构封装了用于分类、光学字符识别（OCR）以及大语言模型结构化字段提取的多种模型流水线。同时，我们分享了在每小时处理数千份多页文档的生产环境中运行该流水线的实践经验。

We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count.

我们描述了核心设计决策，包括混合分类机制、GPU 密集型推理与 CPU 密集型编排的分离、针对流水线中大量 IO 密集型操作采用异步处理，以及独立的水平扩展策略。通过批量性能分析，我们发现了两个影响生产部署的意外定性结论：端到端延迟主要由 OCR 决定而非大模型解析，且系统的并发上限取决于共享的 GPU 推理容量，而非工作节点（worker）的数量。

Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

我们的目标是为从业者提供具体的架构模式，以构建超越基准测试（benchmark）的文档理解系统，从而有效地实现模型在生产环境中的落地。