Office Comprehension Benchmark

Abstract: We introduce Office Comprehension Bench (OCB), the first public benchmark to jointly evaluate LLM systems on Word, Excel, and PowerPoint comprehension over native file formats (.docx, .xlsx, .pptx) and their variants.

摘要： 我们推出了 Office Comprehension Bench (OCB)，这是首个旨在评估大语言模型（LLM）在 Word、Excel 和 PowerPoint 原生文件格式（.docx, .xlsx, .pptx）及其变体上的理解能力的公开基准测试。

OCB consists of two tracks. File Fidelity Q&A tests structural and visual perception of office artifacts - tables, charts, embedded images, formulas, and app-specific elements such as headers, speaker notes, and named ranges.

OCB 包含两个赛道。文件保真度问答（File Fidelity Q&A）测试模型对办公文档结构和视觉元素的感知能力，包括表格、图表、嵌入图像、公式以及特定于应用程序的元素，如页眉、演讲者备注和命名区域。

Domain Q&A tests expert-level reasoning grounded in real-world industry documents across 12 professional domains, with queries requiring multi-step analysis and synthesis across documents.

领域问答（Domain Q&A）测试基于 12 个专业领域真实行业文档的专家级推理能力，查询要求模型进行跨文档的多步分析与综合。

Each reference answer is decomposed into atomic, binary-gradable claims, and an ensemble of LLM judges scores responses against each claim independently.

每个参考答案都被分解为原子级的、可进行二元评分的声明，并由一组 LLM 裁判对每个声明独立进行评分。

Even the strongest frontier system in its default reasoning mode reaches only about 59.3% on Domain Q&A; increasing thinking depth within a tier does not move performance materially, while moving to a higher product tier yields modest gains.

即使是目前最强的前沿系统，在其默认推理模式下，在领域问答中的得分也仅为 59.3% 左右；在同一产品层级内增加思维深度并不能显著提升性能，而升级到更高层级的产品则能带来适度的提升。

We release the dataset, evaluation tooling, judge prompt, and a public leaderboard.

我们现已发布该数据集、评估工具、裁判提示词以及公开排行榜。