dots.ocr: New 1.7B Open-Source VLM Achieves SOTA Multilingual Document Parsing
A new open-source vision-language transformer model, dots.ocr, is poised to redefine multilingual document parsing and optical character recognition (OCR). Developed to streamline the complex process of digital document analysis, dots.ocr integrates both layout detection and content recognition within a single, unified architecture, offering a comprehensive solution for processing a vast array of structured and unstructured documents across more than 100 languages.
At its core, dots.ocr operates as a transformer-based neural network, a type of AI model adept at handling sequential data like text. What sets it apart is its ability to perform both document layout understanding and text extraction simultaneously, eliminating the need for separate, often cumbersome, detection and OCR pipelines. This unified approach not only simplifies the workflow but also allows users to adapt the model’s task by simply adjusting input prompts. With 1.7 billion parameters, the model strikes a balance between computational efficiency and robust performance, making it suitable for a wide range of practical applications. Its flexibility extends to input types, accommodating both image files and PDF documents, and includes advanced preprocessing options, such as fitz_preprocess
, to optimize quality even for low-resolution or dense multi-page files.
The model’s capabilities are notably broad, starting with its extensive multilingual support. Trained on diverse datasets, dots.ocr handles over 100 languages, encompassing major global tongues as well as less common scripts, underscoring its versatility in a globally connected world. Beyond mere text extraction, the model is engineered to pull out plain text, tabular data, and even mathematical formulas, rendering them in formats like LaTeX. Crucially, it meticulously preserves the original reading order and document structure, including table boundaries, formula regions, and image placements. This ensures that the extracted data remains faithful to the source, delivered in structured formats such as JSON, Markdown, or HTML, depending on the content and layout.
In head-to-head evaluations against contemporary document AI systems, dots.ocr has demonstrated impressive performance. For instance, in table parsing accuracy, measured by Table TEDS accuracy, dots.ocr achieved 88.6%, surpassing Gemini2.5-Pro’s 85.8%. Similarly, in text extraction precision, indicated by text edit distance, dots.ocr recorded a lower error rate of 0.032 compared to Gemini2.5-Pro’s 0.055, signifying higher accuracy in recognizing characters. The model also matches or exceeds the performance of leading competitors in the complex tasks of formula recognition and overall document structure reconstruction.
Adding to its appeal, dots.ocr is released under the permissive MIT license, making it freely available as an open-source project. Its source code, comprehensive documentation, and pre-trained models are readily accessible on GitHub, facilitating easy adoption and integration. Developers can deploy the model using standard package managers like pip or Conda, or leverage Docker for containerized environments. The model supports flexible task configuration through prompt templates, enabling both interactive use and integration into automated pipelines for batch document processing. The extracted results are provided in structured JSON for programmatic use, with options for Markdown and HTML where appropriate, complemented by visualization scripts to inspect detected layouts.
In summary, dots.ocr presents a powerful and accessible technical solution for high-accuracy, multilingual document parsing. By unifying layout detection and content recognition within a single, open-source framework, it offers a robust, language-agnostic tool particularly well-suited for information extraction in diverse production environments, even those with limited computational resources.