Comparison with dots.ocr

Image to Markdown Evaluations

VQA Evaluations

Training details

Use-cases

Research

Open Source

Transforming documents into LLM-ready structured data

By Souvik Mandal

Oct 10, 2025 • 7 min read

OVERVIEW

Today, we are proud to announce the release of Nanonets-OCR2, a state-of-the-art suite of models designed for advanced image-to-markdown conversion and Visual Question Answering (VQA).

Nanonets-OCR2 not only converts documents into structured markdown but also leverages intelligent content recognition, semantic tagging, and context-aware visual question answering, enabling deeper understanding and more accurate interpretation of complex documents.

This model builds upon the capabilities of our previous release, Nanonets-OCR-s, offering significant enhancements in document understanding and content differentiation. The improved model can accurately distinguish between standard content and specialized elements such as watermarks, signatures, headers, footers, checkboxes and page numbers. It has been trained to provide more descriptive interpretations of visual elements within documents, while also delivering improved performance on complex structures including tables, checkboxes, and equations. Additionally, the new models are capable of generating Mermaid code for flowcharts and organizational charts, enabling seamless visualization of structured information.

The model has also been specifically trained for Visual Question Answering (VQA) focused on context-driven information extraction. When the requested information is not present in the document, the model is designed to return Not mentioned. This targeted training approach reduces hallucinations compared to models trained on generic VQA tasks, resulting in more accurate and reliable answers.

Nanonets OCR 2 Family Explore and test the model on Docstrange

Model

Access Link

Nanonets-OCR2+

Docstrange Link

Nanonets-OCR2-3B

😊 HuggingFace Link

Nanonets-OCR2-1.5B-exp

😊 HuggingFace Link

KEY FEATURES AND CAPABILITIES

1. LaTeX Equation Recognition

Automatically converts mathematical equations and formulas into properly formatted LaTeX syntax. Inline mathematical expressions are converted to LaTeX inline equations, while displayed equations are converted to LaTeX display equations. Page number is predicted within the <page_number> tag.

Input

Output

Raw ModelOutput

2. Intelligent Image Description

Describes images within documents using structured tags, making them digestible for LLM processing. If the figure caption is present, then uses that as description else the model will generate the description. The model can describe single or multiple images (logos, charts, graphs, qr codes, etc.) in terms of their content, style, and context. The model predicts the image description within the <img> tag.

Input

Output

Raw ModelOutput

3. Signature Detection & Isolation

Identifies and isolates signatures from other text in documents, crucial for legal and business document processing. The model predicts the signature text within the <signature> tag. If the signature is not readable then the model will return <signature>signature<signature> to mark as signed.

Input

Output

Raw ModelOutput

4. Watermark Extraction

Similar to signature detection, the model can detect and extract watermark text from documents. The model predicts the watermark text within the <watermark> tag. The model is performs well on low quality images aswell as shown below

Input

Output

Raw ModelOutput

5. Smart Checkbox Handling

Converts form checkboxes and radio buttons into standardized Unicode symbols for consistent processing. The model predicts the checkbox status within the <checkbox> tag.

Input

Output

Raw ModelOutput

6. Complex Table Extraction

Extracts complex tables from documents and converts them into markdown and html tables.

Input

Output

Raw ModelOutput

7. Flow chart & organizational chart

The model extracts mermaid code for flowchart and organizational charts.

Input

Output

Raw ModelOutput

8. Multilingual

Model is trained on documents of multiple languages, including English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and many more.

Input

Output

Raw ModelOutput

9. Visual Question Answering

The model is designed to provide the answer directly if it is present in the document; otherwise, it responds with "Not mentioned."

Input

COMPARISON WITH DOTS.OCR

1. Checkboxes

Comparison of checkbox detection and extraction between Nanonets OCR 2 and dots.ocr.

Input

Nanonets Output

dots.ocr Output

2. Flowchart

Comparison of flowchart extraction between Nanonets OCR 2 and dots.ocr.

Input

Nanonets Output

dots.ocr Output

3. Image Description

Comparison of image description generation between Nanonets OCR 2 and dots.ocr.

Input

Nanonets Output

dots.ocr Output

4. Signature

Comparison of signature detection between Nanonets OCR 2 and dots.ocr.

Input

Nanonets Output

dots.ocr Output

5. Tables

Comparison of table extraction between Nanonets OCR 2 and dots.ocr.

Input

Nanonets Output

dots.ocr Output

6. Watermark

Comparison of watermark handling between Nanonets OCR 2 and dots.ocr.

Input

Nanonets Output

dots.ocr Output

IMAGE TO MARKDOWN EVALUATIONS

We used Gemini-2.5-Pro as the judge model to evaluate the markdown outputs generated by two models. Although existing benchmarks such as olmOCRbench and OmniDocBench
are available, they have notable limitations when it comes to assessing image-to-markdown performance — which we'll explore in detail in a separate post. We plan to open-source the evaluation code and model predictions on our GitHub repository.

Nanonets OCR 2+

Model

Win Rate vs Nanonets OCR
2+ (%)

Lose Rate vs Nanonets OCR
2+ (%)

Both Correct

Gemini 2.5 Flash
( No thinking )

34.35

57.60

8.06

Nanonets OCR 2 3B

29.37

54.58

16.04

Nanonets-OCR-s

24.86

66.12

9.02

Nanonets-OCR2 1.5B

13.00

81.20

5.79

Nanonets-OCR2 1.5B

13.00

81.20

5.79

GPT-5 ( Thinking: low)

23.53

74.86

1.60

Nanonets OCR 2+

Model

Win Rate vs Nanonets OCR
2+ (%)

Lose Rate vs Nanonets OCR
2+ (%)

Both Correct

Gemini 2.5 Flash
( No thinking )

34.35

57.60

8.06

Nanonets OCR 2 3B

29.37

54.58

16.04

Nanonets-OCR-s

24.86

66.12

9.02

Nanonets-OCR2 1.5B

13.00

81.20

5.79

Nanonets-OCR2 1.5B

13.00

81.20

5.79

GPT-5 ( Thinking: low)

23.53

74.86

1.60

Nanonets OCR 2 3B

Model

Win Rate vs Nanonets OCR
2 3B (%)

Lose Rate vs Nanonets OCR
2 3B (%)

Both Correct

Gemini 2.5 Flash
( No thinking )

39.98

52.43

7.58

Nanonets-OCR-s

30.61

58.28

11.12

Nanonets-OCR2 1.5B

14.78

79.18

GPT-5 ( Thinking: low)

25.00

72.87

2.13

VQA EVALUATIONS

We have used IDP Leaderboard 's VQA datasets to evaluate these models.

IDP Leaderboard: VQA Subset

Datasets

Nanonets OCR 2+

Nanonets OCR 2 3B

Qwen 2.5-VL-72B Instruct

Gemini-2.5-Flash

Chart QA

79.20

78.56

76.20

84.82

DocVQA

85.15

89.43

84.00

85.51

TRAINING DETAILS

To train our new Visual-Language Model (VLM) for high-precision optical character recognition (OCR), we assembled a dataset of over 3 million pages. This dataset encompasses a wide range of document types, including research papers, financial reports, legal contracts, healthcare records, tax forms, receipts, and invoices. It also includes documents featuring embedded images, plots, equations, signatures, watermarks, checkboxes, and complex tables. Furthermore, we incorporated flowcharts, organizational charts, handwritten materials, and multilingual documents to ensure comprehensive coverage of real-world document variations.

We have used both synthetic and manually annotated datasets. We first trained the model on the synthetic dataset and then fine-tuned it on the manually annotated dataset.
We selected the Qwen2.5-VL-3B model as the base model for our Visual-Language Model (VLM). This model was subsequently fine-tuned on the curated dataset to improve its performance on document-specific Optical Character Recognition (OCR) tasks.

Limitations:

• For complex flowcharts and organizational charts the model might produce incorrect results.
• Model can suffer from hallucination.

USECASES

Nanonets-OCR2 streamlines complex document workflows across industries by unlocking structured data from unstructured formats.

Academic & Research

Digitizes papers with LaTeX equations and tables.

Legal & Financial

Digitizes papers with LaTeX equations and tables.

Healthcare & Pharma

Accurately captures text and checkboxes from medical forms.

Corporate & Enterprise

Transforms reports into searchable, image -aware knowledge bases.

In a world moving towards LLM-driven automation, unstructured data is the biggest bottleneck. Nanonets-OCR2 bridges that gap, transforming messy documents into the clean, structured, and context-rich markdown that modern AI applications demand.

TRY IT TODAY

We have integrated Nanonets-OCR2 with Docstrange feel free to try it. Feel free to start a discussion on GitHub or Hugging Face if you have any questions.