Transforming documents into LLM-ready structured data
Today, we are proud to announce the release of Nanonets-OCR2, a state-of-the-art suite of models designed for advanced image-to-markdown conversion and Visual Question Answering (VQA).
Nanonets-OCR2 not only converts documents into structured markdown but also leverages intelligent content recognition, semantic tagging, and context-aware visual question answering, enabling deeper understanding and more accurate interpretation of complex documents.
This model builds upon the capabilities of our previous release, Nanonets-OCR-s, offering significant
enhancements in document understanding and content differentiation. The improved model can accurately
distinguish between standard content and specialized elements such as watermarks, signatures, headers,
footers, checkboxes and page numbers. It has been trained to provide more descriptive interpretations of
visual elements within documents, while also delivering improved performance on complex structures
including tables, checkboxes, and equations. Additionally, the new models are capable of generating
Mermaid code for flowcharts and organizational charts, enabling seamless visualization of structured
information.
The model has also been specifically trained for Visual Question Answering (VQA)
focused on context-driven information extraction. When the requested information is not present in the
document, the model is designed to return Not mentioned. This targeted training approach reduces
hallucinations compared to models trained on generic VQA tasks, resulting in more accurate and reliable
answers.
Automatically converts mathematical equations and formulas into properly formatted LaTeX syntax. Inline mathematical expressions are converted to LaTeX inline equations, while displayed equations are converted to LaTeX display equations. Page number is predicted within the <page_number> tag.
Describes images within documents using structured tags, making them digestible for LLM processing. If the figure caption is present, then uses that as description else the model will generate the description. The model can describe single or multiple images (logos, charts, graphs, qr codes, etc.) in terms of their content, style, and context. The model predicts the image description within the <img> tag.

Identifies and isolates signatures from other text in documents, crucial for legal and business document processing. The model predicts the signature text within the <signature> tag. If the signature is not readable then the model will return <signature>signature<signature> to mark as signed.

Similar to signature detection, the model can detect and extract watermark text from documents. The model predicts the watermark text within the <watermark> tag. The model is performs well on low quality images aswell as shown below
Converts form checkboxes and radio buttons into standardized Unicode symbols for consistent processing. The model predicts the checkbox status within the <checkbox> tag.
Extracts complex tables from documents and converts them into markdown and html tables.
7. Flow chart & organizational chart
The model extracts mermaid code for flowchart and organizational charts.
Model is trained on documents of multiple languages, including English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and many more.
The model is designed to provide the answer directly if it is present in the document; otherwise, it responds with "Not mentioned."
Comparison of checkbox detection and extraction between Nanonets OCR 2 and dots.ocr.
Comparison of flowchart extraction between Nanonets OCR 2 and dots.ocr.
Comparison of image description generation between Nanonets OCR 2 and dots.ocr.
Comparison of signature detection between Nanonets OCR 2 and dots.ocr.
Comparison of table extraction between Nanonets OCR 2 and dots.ocr.
Comparison of watermark handling between Nanonets OCR 2 and dots.ocr.
We used Gemini-2.5-Pro as the judge model to evaluate the markdown outputs generated by two models.
Although existing benchmarks such as
olmOCRbench
and
OmniDocBench
are available, they have notable limitations when it comes to assessing image-to-markdown performance —
which we'll explore in detail in a separate post. We plan to open-source the evaluation code and
model predictions on our GitHub repository.
Win Rate vs Nanonets OCR
2+ (%)
Lose Rate vs Nanonets OCR
2+ (%)
Gemini 2.5 Flash
( No thinking )
Win Rate vs Nanonets OCR
2+ (%)
Lose Rate vs Nanonets OCR
2+ (%)
Gemini 2.5 Flash
( No thinking )
Win Rate vs Nanonets OCR
2 3B (%)
Lose Rate vs Nanonets OCR
2 3B (%)
Gemini 2.5 Flash
( No thinking )
We have used IDP Leaderboard 's VQA datasets to evaluate these models.
To train our new Visual-Language Model (VLM) for high-precision optical character recognition (OCR), we
assembled a dataset of over 3 million pages. This dataset encompasses a wide range of document types,
including research papers, financial reports, legal contracts, healthcare records, tax forms, receipts, and
invoices. It also includes documents featuring embedded images, plots, equations, signatures, watermarks,
checkboxes, and complex tables. Furthermore, we incorporated flowcharts, organizational charts, handwritten
materials, and multilingual documents to ensure comprehensive coverage of real-world document variations.
We
have used both synthetic and manually annotated datasets. We first trained the model on the synthetic
dataset and then fine-tuned it on the manually annotated dataset.
We selected the Qwen2.5-VL-3B model
as the base model for our Visual-Language Model (VLM). This model was subsequently fine-tuned on the curated
dataset to improve its performance on document-specific Optical Character Recognition (OCR) tasks.
Limitations:
•
For complex flowcharts and organizational charts the model might produce incorrect results.
• Model can
suffer from hallucination.
Nanonets-OCR2 streamlines complex document workflows across industries by unlocking structured data from unstructured formats.
Digitizes papers with LaTeX equations and tables.
Digitizes papers with LaTeX equations and tables.
Accurately captures text and checkboxes from medical forms.
Transforms reports into searchable, image -aware knowledge bases.
In a world moving towards LLM-driven automation, unstructured data is the biggest bottleneck. Nanonets-OCR2 bridges that gap, transforming messy documents into the clean, structured, and context-rich markdown that modern AI applications demand.
We have integrated Nanonets-OCR2 with Docstrange feel free to try it. Feel free to start a discussion on GitHub or Hugging Face if you have any questions.