Toolkit for linearizing PDFs for LLM datasets/training

olmocr

该项目是基于视觉语言模型（VLMs）解析和线性化复杂的 PDF 文档，即将非结构化的内容（如多列文本、表格、嵌入式图片、混杂的字体样式和布局）转换为连续、结构化的文本表示。它支持分布式多节点解析数百万份 PDF 文档的全流程，为大语言模型（LLMs）构建高质量的数据集。

This project leverages Vision-Language Models (VLMs) to parse and linearize complex PDF documents, converting unstructured content (such as multi-column text, tables, embedded images, mixed font styles, and layouts) into continuous, structured text representations. It supports the full process of distributed multi-node parsing of millions of PDF documents, enabling the construction of high-quality datasets for Large Language Models (LLMs).

allenai/olmocr

allenai/olmocr

Comments