run-llama / liteparse
run-llama / liteparse
LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine.
LiteParse 是一款独立的开源 PDF 解析工具,专注于快速、轻量级的解析。它提供高质量的带有边界框(bounding boxes)的空间文本解析功能,无需专有的 LLM 功能或云端依赖。所有操作均在您的本地机器上运行。
Hitting the limits of local parsing? For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you’ll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown.
本地解析遇到瓶颈了吗?对于复杂文档(如密集表格、多栏布局、图表、手写文字或扫描版 PDF),使用我们为生产级文档流水线构建的云端文档解析器 LlamaParse,将获得显著更好的效果。LlamaParse 可以处理这些棘手的任务,确保您的模型接收到干净、结构化的数据和 Markdown 内容。
Overview (概述)
- Fast Text Parsing: Spatial text parsing using PDFium 快速文本解析: 使用 PDFium 进行空间文本解析
- Flexible OCR System: Built-in: Tesseract (zero setup, bundled with the library) 灵活的 OCR 系统: 内置 Tesseract(无需配置,随库打包)
- HTTP Servers: Plug in any OCR server (EasyOCR, PaddleOCR, custom) HTTP 服务器: 可接入任何 OCR 服务器(如 EasyOCR、PaddleOCR 或自定义服务器)
- Standard API: Simple, well-defined OCR API specification 标准 API: 简单且定义明确的 OCR API 规范
- Screenshot Generation: Generate high-quality page screenshots for LLM agents 截图生成: 为 LLM 智能体生成高质量的页面截图
- Multiple Output Formats: JSON and Text 多种输出格式: 支持 JSON 和文本
- Bounding Boxes: Precise text positioning information 边界框: 精确的文本定位信息
- Multi-language: Use from Rust, Node.js/TypeScript, Python, or the browser (WASM) 多语言支持: 可在 Rust、Node.js/TypeScript、Python 或浏览器(WASM)中使用
- Multi-platform: Linux, macOS (Intel/ARM), Windows 多平台支持: Linux、macOS (Intel/ARM) 和 Windows
Installation (安装)
Install via your preferred package manager. All versions (except WASM) ship with the same lit CLI.
通过您偏好的包管理器进行安装。所有版本(WASM 除外)均附带相同的 lit 命令行工具。
| Language | Install Library | Docs |
|---|---|---|
| Node.js / TypeScript | npm i @llamaindex/liteparse | Node.js README |
| Python | pip install liteparse | Python README |
| Rust | cargo install liteparse (CLI) / cargo add liteparse (lib) | Rust README |
| Browser (WASM) | npm i @llamaindex/liteparse-wasm | WASM README |
CLI Usage (命令行使用)
The CLI is the same across all installations (npm, pip, cargo install).
无论通过何种方式安装(npm、pip 或 cargo),命令行工具的使用方法均相同。
Parse Files (解析文件)
# Basic parsing
lit parse document.pdf
# Parse with specific format
lit parse document.pdf --format json -o output.json
# Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"
# Parse without OCR
lit parse document.pdf --no-ocr
# Parse a remote PDF
curl -sL https://example.com/report.pdf | lit parse -
Batch Parsing (批量解析) Parse an entire directory of documents: 解析整个文档目录:
lit batch-parse ./input-directory ./output-directory
Generate Screenshots (生成截图) Screenshots are essential for LLM agents to extract visual information that text alone cannot capture. 截图对于 LLM 智能体提取仅靠文本无法获取的视觉信息至关重要。
# Screenshot all pages
lit screenshot document.pdf -o ./screenshots
# Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
# Custom DPI
lit screenshot document.pdf --dpi 300 -o ./screenshots