Introducing Nanonets OCR-3

OCR-3 is global #1 on benchmarks

93.1 on olmOCR benchmark. 90.5 on OmniDocBench. #1 on IDP Leaderboard.

	Specialized VLMs					General VLMs
	Nanonets OCR-3	Chandra OCR 2	LightOn OCR-2	Deepseek- OCR2	Mistral OCR 3	Gemini 3.1 Pro	GPT-5.4
OlmOCR Benchmark	87.4	85.9	83.2	76.3	81.7	79.6	81.0
ArXiv Math	89.2	90.2	89.6	81.9	85.4	70.6	83.1
H&F	96.6	92.5	–	95.6	93.8	–	–
Long/Tiny	93.4	92.1	91.4	88.7	88.9	90.3	82.6
Multi-Col	87.6	83.5	84.8	83.6	82.1	79.2	83.7
Old Scans	49.6	49.8	42.2	33.7	48.8	47.5	43.9
Scans Math	88.9	89.3	85.6	68.8	68.3	84.9	82.3
Tables	94.2	89.9	89.0	78.1	86.1	84.9	91.1
OmniDocBench (v1.5)	90.5	85.5	–	87.7	85.3	85.3	85.3

LLM as a Judge: Corrected Scores

437 of 864 failed tests were identified as evaluator brittleness, not model errors. Weighted average: 94.9% (7,986 / 8,413 tests pass).

	ArXiv Math	H&F	Long/Tiny	Multi-Col	Old Scans	Scans Math	Tables	Overall
Official	89.2	96.6	93.4	87.6	49.6	88.9	94.2	87.4
Corrected	95.5	96.7	96.6	94.1	67.3	95.9	98.9	93.1

But we're more proud of these scores

While a lot of OCR models today are busy benchmaxxing on saturated benchmarks that don't translate into real-world use, Nanonets OCR-3 is specifically trained and tested on real-world documents.

94.5%

FinanceBench

Dense SEC 10-K filings averaging 143 pages with nested tables, footnotes and cross-references.

96.0%

DocBench Legal

Multi-column court filings and legislation with complex formatting, citations and structural hierarchy.

90.1%

HealthcareBench

Clinical notes, discharge summaries, lab reports, insurance EOBs, and prior authorization forms.

The only document model you’ll ever need

If you have built a document pipeline, LLM-based or otherwise, you already know how brittle they are.

Nanonets OCR-3 is a Mixture-of-Experts model purpose-built to handle these pains. The model API exposes five endpoints to cover all use cases:

/parse — Send a document, get back structured markdown. Layout preserved, simple tables as markdown pipes, complex tables as HTML, reading order correct, metadata with bounding boxes and confidence scores on every element. One call replaces your entire parsing stack.
/extract — Pass a document and your schema. Get back a schema-compliant, type-safe object along with metadata with bounding boxes and confidence scores on every element. Works on invoices, forms, contracts, medical records, and any other document with a repeating structure.
/split — Send a large PDF or multiple PDFs, get back split or classified documents based on your own logic using document structure and content. Useful when your input is a batch scan or a mixed upload and you need to route each document type separately.
/chunk — Splits a document into context-aware chunks optimized for RAG retrieval and inference. Unlike traditional chunking methods, it respects document structure as sections stay together, tables don't get cut in half, etc.
/vqa — Ask a question about a document, get a grounded answer with bounding boxes over the source regions. Useful for building UIs over documents and sending precise regions to downstream LLMs without needing to build a retrieval pipeline first.

Confidence scores, bounding boxes, VQA

Nanonets OCR-3 ships with three critical output features that most OCR models and document pipelines miss today.

1. Confidence scores

Every extraction comes with confidence scores, which enables you to build pipelines with 100% accuracy. You can pass high-confidence outputs directly, route low-confidence outputs to HIL or larger models, and ensure your production databases aren't poisoned with incorrect data.

2. Bounding boxes

OCR-3 outputs spatial coordinates for every element. This enables you to highlight source locations in your UI, power citation trails in RAG pipelines, pass charts/images/sections exclusively to VLMs, and feed precise regions to document agents and downstream LLMs.

Give document agents surgical precision

Agent: Verify the total revenue claim against the table

REGIONS

1Revenue KPI[32,172,140,48]

2Quarterly Data Table[32,420,476,120]

3Executive Summary[32,100,476,60]

EXECUTION

✓

LOCATE

Find revenue claim in executive summary at [32,92,476,60]

✓

EXTRACT

Claim: '$71.4M total revenue'

✓

CROSS-REF

Sum table rows: Q4 column = $10.4M + $7.8M + $6.0M = $24.2M

✓

VERIFY

Annual sum = $12.4M+$16.1M+$18.7M+$24.2M = $71.4M ✓ Matches claim

3. Visual Question Answering

The model's API natively supports visual question answering. You can ask questions about a document and get grounded answers with supporting evidence from the page.

Query:

Fine-tuned on edge cases

We’ve been working in the space since the last 7 years, and have repeatedly seen the same edge cases where OCR fails. Nanonets OCR-3 is extensively fine-tuned on these edge cases:

OCR-3 parses simple tables in markdown and complex tables in HTML. It preserves colspan / rowspan in merged cells, does not flatten nested tables, preserves indentation level as metadata, and retains structure perfectly on sparse tables.

Trained and tested to perform context-aware parsing on complex documents ensuring accurate layout extraction and reading order.

Extensive fine-tuning runs on W-2, W-4, 1040, etc. to ensure 99%+ extraction accuracy on forms.
The model API is integrated with OCR engines for deterministic, character-level accuracy on complex numbers and dates where pure VLMs are prone to hallucinations.

NanoIndex: A New Vectorless RAG Framework

Bad document ingestion is the #1 reason RAG pipelines fail. NanoIndex fixes this. Vectorless, context-aware, layout-aware, with pixel level precision.

How it works

Run Nanonets OCR-3 on your documents.
OCR-3 extracts structured markdown, hierarchy, tables, and bounding boxes.
A deterministic tree builder turns that into a navigable tree. Zero LLM calls.
Your downstream LLM navigates the tree, returns answers with citations and page numbers.

Why OCR-3

The pipeline depends on extraction quality. OCR-3 provides in a single API call: structured markdown with headings preserved, table-of-contents and hierarchy detection, tables as structured data, bounding boxes per element for page-level citations, and layout understanding that distinguishes headers from body text.

NanoIndex in action. Full source code and examples dropping next week.

Devlog

Compared to v2, our new model is bigger (35B parameters), yet it is faster and cheaper.

Here’s how we achieved it:

1. Token Efficiency

Training on full-resolution images the whole time is expensive and, it turns out, unnecessary. We trained 75% of the time on low-resolution images and 25% on full resolution. The model performs the same as training on 100% full resolution.

Most of what a document model needs to learn is structural. Where are the tables? What’s a header versus body text? How do columns relate? Low-resolution images teach this fine.

High-resolution matters for character-level detail, but you don’t need to see every character in crispy detail to learn layout.

At inference, the model caps token usage at 1280 tokens per image. This results in predictable latency and costs without degradation in accuracy.

2. MoE architecture

The new architecture is a Mixture-of-Experts model.

A standard transformer runs every parameter on every token. MoE activates only the relevant experts - 2 or 3 sub-networks out of many.

For us, this meant 2x faster inference compared to the previous dense model at equivalent quality. The experts specialized on their own and we didn’t need to design that in.

3. Preventing catastrophic forgetting

We realized that this is a real phenomenon first-hand. Fine tune too hard on one domain and the model forgets everything else.

We used frozen backbone layers, EWC regularization, replay buffers (15% of each batch was non-document data), and alternating OCR/general-knowledge training phases.

The model still handles general tasks accurately, which matters for VQA, agents, etc.

Lastly, we trained the model on over 11 million documents in just under a month, and to do this, we used optimization techniques like:

Gradient checkpointing - instead of storing all intermediate activations in memory during the forward pass, we recomputed them during backprop. It was slower compute, but it cut memory by roughly 60–70%. This let us fit larger batch sizes on the same hardware.
Mixed precision training - forward and backward passes in FP16, weight updates in FP32. This means we use half the memory bandwidth for most of the work, with FP32 precision for where it matters (such as numerical stability).
Distributed training - model and data parallelism across GPU clusters.
Learning rate scheduling - warmup for the first few thousand steps, then cosine decay. The warmup prevents the early instability you get when gradients are large and the model is far from any sensible solution. Cosine decay lets the model settle into a good minimum rather than oscillating.

None of this is novel but getting it tuned correctly for our specific architecture took almost a week.

We'll release a full-length technical blog on our methodology soon. Stay tuned if you're interested.

Get API Key →Documentation

IntroducingNanonets OCR-3