microsoft / markitdown

Microsoft / MarkItDown

Important MarkItDown performs I/O with the privileges of the current process. Like open() or requests.get(), it will access resources that the process itself can access. Sanitize your inputs in untrusted environments, and call the narrowest convert_* function needed for your use case (e.g., convert_stream(), or convert_local()). See the Security Considerations section of the documentation for more information.

重要提示 MarkItDown 执行 I/O 操作时拥有当前进程的权限。与 open()requests.get() 一样,它会访问该进程本身有权访问的资源。在不受信任的环境中,请务必对输入进行清理,并根据使用场景调用范围最小的 convert_* 函数(例如 convert_stream()convert_local())。更多信息请参阅文档中的“安全注意事项”部分。

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to textract, but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools — and may not be the best option for high-fidelity document conversions for human consumption.

MarkItDown 是一个轻量级的 Python 工具,用于将各种文件转换为 Markdown,以便与大语言模型(LLM)及相关文本分析流水线配合使用。从这个角度来看,它与 textract 最为相似,但重点在于将重要的文档结构和内容保留为 Markdown 格式(包括标题、列表、表格、链接等)。虽然输出结果通常相当美观且易于阅读,但其设计初衷是供文本分析工具使用,因此可能并非人类阅读所需的高保真文档转换的最佳选择。

MarkItDown currently supports the conversion from: PDF, PowerPoint, Word, Excel, Images (EXIF metadata and OCR), Audio (EXIF metadata and speech transcription), HTML, Text-based formats (CSV, JSON, XML), ZIP files (iterates over contents), Youtube URLs, EPubs … and more!

MarkItDown 目前支持转换的文件格式包括:PDF、PowerPoint、Word、Excel、图像(EXIF 元数据和 OCR)、音频(EXIF 元数据和语音转录)、HTML、基于文本的格式(CSV、JSON、XML)、ZIP 文件(遍历内容)、YouTube 链接、EPub 等等!

Why Markdown? Markdown is extremely close to plain text, with minimal markup or formatting, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI’s GPT-4o, natively “speak” Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions are also highly token-efficient.

为什么选择 Markdown? Markdown 非常接近纯文本,标记或格式化极少,但仍能提供一种表示重要文档结构的方法。主流大语言模型(如 OpenAI 的 GPT-4o)原生支持“使用”Markdown,并且经常在未被提示的情况下在回复中加入 Markdown。这表明它们已经在海量的 Markdown 格式文本上进行了训练,并能很好地理解它。此外,Markdown 约定在 Token 使用上也具有极高的效率。

Prerequisites MarkItDown requires Python 3.10 or higher. It is recommended to use a virtual environment to avoid dependency conflicts.

先决条件 MarkItDown 需要 Python 3.10 或更高版本。建议使用虚拟环境以避免依赖冲突。

With the standard Python installation, you can create and activate a virtual environment using the following commands:

python -m venv .venv
source .venv/bin/activate

使用标准 Python 安装时,可以通过以下命令创建并激活虚拟环境:

python -m venv .venv
source .venv/bin/activate

If using uv, you can create a virtual environment with:

uv venv --python=3.12 .venv
source .venv/bin/activate
# NOTE: Be sure to use 'uv pip install' rather than just 'pip install' to install packages in this virtual environment

如果使用 uv,可以通过以下命令创建虚拟环境:

uv venv --python=3.12 .venv
source .venv/bin/activate
# 注意:在此虚拟环境中安装包时,请务必使用 'uv pip install' 而不是 'pip install'

If you are using Anaconda, you can create a virtual environment with:

conda create -n markitdown python=3.12
conda activate markitdown

如果您使用 Anaconda,可以通过以下命令创建虚拟环境:

conda create -n markitdown python=3.12
conda activate markitdown

Installation To install MarkItDown, use pip: pip install 'markitdown[all]'. Alternatively, you can install it from the source:

git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'

安装 要安装 MarkItDown,请使用 pip:pip install 'markitdown[all]'。或者,您也可以从源码安装:

git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'

Usage Command-Line: markitdown path-to-file.pdf > document.md Or use -o to specify the output file: markitdown path-to-file.pdf -o document.md You can also pipe content: cat path-to-file.pdf | markitdown

用法 命令行:markitdown path-to-file.pdf > document.md 或者使用 -o 指定输出文件:markitdown path-to-file.pdf -o document.md 您也可以通过管道传输内容:cat path-to-file.pdf | markitdown

Optional Dependencies MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the [all] option. However, you can also install them individually for more control. For example: pip install 'markitdown[pdf, docx, pptx]' will install only the dependencies for PDF, DOCX, and PPTX files.

可选依赖 MarkItDown 具有用于激活各种文件格式的可选依赖项。在本文档的前面部分,我们通过 [all] 选项安装了所有可选依赖项。不过,您也可以单独安装它们以获得更精细的控制。例如:pip install 'markitdown[pdf, docx, pptx]' 将仅安装 PDF、DOCX 和 PPTX 文件的依赖项。

At the moment, the following optional dependencies are available:

  • [all] Installs all optional dependencies
  • [pptx] Installs dependencies for PowerPoint files
  • [docx] Installs dependencies for Word files
  • [xlsx] Installs dependencies for Excel files
  • [xls] Installs dependencies for older Excel files
  • [pdf] Installs dependencies for PDF files
  • [outlook] Installs dependencies for Outlook messages
  • [az-doc-intel] Installs dependencies for Azure Document Intelligence
  • [az-content-understanding] Installs dependencies for Azure Content Understanding
  • [audio-transcription] Installs dependencies for audio transcription of wav and mp3 files
  • [youtube-transcription] Installs dependencies for fetching YouTube video transcription

目前可用的可选依赖项包括:

  • [all] 安装所有可选依赖项
  • [pptx] 安装 PowerPoint 文件的依赖项
  • [docx] 安装 Word 文件的依赖项
  • [xlsx] 安装 Excel 文件的依赖项
  • [xls] 安装旧版 Excel 文件的依赖项
  • [pdf] 安装 PDF 文件的依赖项
  • [outlook] 安装 Outlook 邮件的依赖项
  • [az-doc-intel] 安装 Azure Document Intelligence 的依赖项
  • [az-content-understanding] 安装 Azure Content Understanding 的依赖项
  • [audio-transcription] 安装 wav 和 mp3 音频转录的依赖项
  • [youtube-transcription] 安装获取 YouTube 视频转录的依赖项

Plugins MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins: markitdown --list-plugins To enable plugins use: markitdown --use-plugins path-to-file.pdf To find available plugins, search GitHub for the hashtag #markitdown-plugin. To develop a plugin, see packages/markitdown-sample-plugin.

插件 MarkItDown 还支持第三方插件。插件默认处于禁用状态。 列出已安装的插件:markitdown --list-plugins 启用插件:markitdown --use-plugins path-to-file.pdf 要查找可用插件,请在 GitHub 上搜索标签 #markitdown-plugin。要开发插件,请参阅 packages/markitdown-sample-plugin

markitdown-ocr Plugin The markitdown-ocr plugin adds OCR support to PDF, DOCX, PPTX, and XLSX converters, extracting text from embedded images using LLM Vision — the same llm_client / llm_model pattern that MarkItDown already uses for image descriptions. No new ML libraries or binary dependencies required.

markitdown-ocr 插件 markitdown-ocr 插件为 PDF、DOCX、PPTX 和 XLSX 转换器增加了 OCR 支持,利用 LLM Vision 从嵌入图像中提取文本——这与 MarkItDown 已用于图像描述的 llm_client / llm_model 模式相同。无需新的机器学习库或二进制依赖。

Installation:

pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client

安装:

pip install markitdown-ocr
pip install openai # 或任何兼容 OpenAI 的客户端

Usage:

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)

用法:

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)

If no llm_client is provided the plugin still loads, but OCR is silently skipped and the standard built-in converter is used instead. See packages/markitdown-ocr/README.md for detailed documentation.

如果未提供 llm_client,插件仍会加载,但 OCR 会被静默跳过,并改用标准的内置转换器。详细文档请参阅 packages/markitdown-ocr/README.md

Azure Content Understanding Azure Content Understanding provides higher-quality conversion with structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers. Install: pip install 'markitdown[az-content-understanding]'

Azure Content Understanding Azure Content Understanding 提供更高质量的转换,支持结构化字段提取(YAML front matter)、多模态支持(文档、图像、音频、视频)以及可配置的分析器。 安装:pip install 'markitdown[az-content-understanding]'

When to use Content Understanding Content Understanding is ideal when you need capabilities beyond what built-in or Document Intelligence converters provide:

  • Audio and video files — CU is the only option for video, and the higher-quality cloud option for audio. Built-in converters have no video support and only basic audio transcription.
  • Structured field extraction — Prebuilt or custom-built analyzers extract domain-specific fields (invoice amounts, receipt dates, contract clauses) serialized as YAML front matter. Neither built-in nor Doc Intel integration exposes fields.

何时使用 Content Understanding 当您需要超出内置转换器或 Document Intelligence 所提供的功能时,Content Understanding 是理想的选择:

  • 音频和视频文件 — CU 是视频的唯一选择,也是音频更高质量的云端选择。内置转换器不支持视频,且仅提供基础的音频转录。
  • 结构化字段提取 — 预构建或自定义的分析器可以提取特定领域的字段(如发票金额、收据日期、合同条款),并序列化为 YAML front matter。内置转换器和 Doc Intel 集成均不提供字段提取功能。