OCR-3 is global #1 on benchmarks

93.1 on olmOCR benchmark. 90.5 on OmniDocBench. #1 on IDP Leaderboard.

Specialized VLMsGeneral VLMs
Nanonets OCR-3
Chandra
OCR 2
LightOn
OCR-2
Deepseek-
OCR2
Mistral
OCR 3
Gemini 3.1
Pro
GPT-5.4
OlmOCR Benchmark87.485.983.276.381.779.681.0
ArXiv Math89.290.289.681.985.470.683.1
H&F96.692.595.693.8
Long/Tiny93.492.191.488.788.990.382.6
Multi-Col87.683.584.883.682.179.283.7
Old Scans49.649.842.233.748.847.543.9
Scans Math88.989.385.668.868.384.982.3
Tables94.289.989.078.186.184.991.1
OmniDocBench (v1.5)90.585.587.785.385.385.3

LLM as a Judge: Corrected Scores

437 of 864 failed tests were identified as evaluator brittleness, not model errors. Weighted average: 94.9% (7,986 / 8,413 tests pass).

ArXiv MathH&FLong/TinyMulti-ColOld ScansScans MathTablesOverall
Official89.296.693.487.649.688.994.287.4
Corrected95.596.796.694.167.395.998.993.1

But we're more proud of these scores

While a lot of OCR models today are busy benchmaxxing on saturated benchmarks that don't translate into real-world use, Nanonets OCR-3 is specifically trained and tested on real-world documents.

94.5%
FinanceBench
Dense SEC 10-K filings averaging 143 pages with nested tables, footnotes and cross-references.
96.0%
DocBench Legal
Multi-column court filings and legislation with complex formatting, citations and structural hierarchy.
90.1%
HealthcareBench
Clinical notes, discharge summaries, lab reports, insurance EOBs, and prior authorization forms.

The only document model you’ll ever need

If you have built a document pipeline, LLM-based or otherwise, you already know how brittle they are.

Nanonets OCR-3 is a Mixture-of-Experts model purpose-built to handle these pains. The model API exposes five endpoints to cover all use cases:

  1. /parse — Send a document, get back structured markdown. Layout preserved, simple tables as markdown pipes, complex tables as HTML, reading order correct, metadata with bounding boxes and confidence scores on every element. One call replaces your entire parsing stack.
  2. /extract — Pass a document and your schema. Get back a schema-compliant, type-safe object along with metadata with bounding boxes and confidence scores on every element. Works on invoices, forms, contracts, medical records, and any other document with a repeating structure.
  3. /split — Send a large PDF or multiple PDFs, get back split or classified documents based on your own logic using document structure and content. Useful when your input is a batch scan or a mixed upload and you need to route each document type separately.
  4. /chunk — Splits a document into context-aware chunks optimized for RAG retrieval and inference. Unlike traditional chunking methods, it respects document structure as sections stay together, tables don't get cut in half, etc.
  5. /vqa — Ask a question about a document, get a grounded answer with bounding boxes over the source regions. Useful for building UIs over documents and sending precise regions to downstream LLMs without needing to build a retrieval pipeline first.

Confidence scores, bounding boxes, VQA

Nanonets OCR-3 ships with three critical output features that most OCR models and document pipelines miss today.

1. Confidence scores

Every extraction comes with confidence scores, which enables you to build pipelines with 100% accuracy. You can pass high-confidence outputs directly, route low-confidence outputs to HIL or larger models, and ensure your production databases aren't poisoned with incorrect data.

Scanned documentsVLM extractionwith confidence scores> 90%direct pass60 – 90%Gemini 3.1 Prore-extract documentOutputsmatch?yesno< 60%Human reviewpassProduction databasesor downstream tasks

2. Bounding boxes

OCR-3 outputs spatial coordinates for every element. This enables you to highlight source locations in your UI, power citation trails in RAG pipelines, pass charts/images/sections exclusively to VLMs, and feed precise regions to document agents and downstream LLMs.

Give document agents surgical precision

FY2026 Annual Revenue ReportVantage Systems · Published March 2026 · ConfidentialEXECUTIVE SUMMARYAnnual revenue reached $71.4M, representing a 34% year-over-year increase driven bystrong enterprise adoption and expansion into APAC markets. Gross margins improved to68.2%, up from 61.7% in FY2025, as the shift to SaaS recurring revenue accelerated.Customer acquisition cost decreased 18% while net retention exceeded 125%.REVENUE$71.4M+34% YoYGROWTH34.2%vs 22% priorGROSS MARGIN68.2%+6.5ptsQUARTERLY REVENUE$12.4MQ1$16.1MQ2$18.7MQ3$24.2MQ4ANALYSISQ4 acceleration was driven bythree enterprise deals exceeding$1M ARR each. Pipeline forFY2027 Q1 shows continuedmomentum in healthcare andfintech verticals.KEY INSIGHTNet revenue retention hit 125%,meaning existing customers grewspend by 25% on average.SegmentQ1Q2Q3Q4Enterprise$5.2M$6.8M$8.1M$10.4MMid-Market$4.1M$5.3M$6.0M$7.8MSMB$3.1M$4.0M$4.6M$6.0M¹ Revenue figures audited by Deloitte & Touche LLP. All amounts in USD.² Growth metrics calculated on a constant-currency basis.VANTAGEThis document contains forward-looking statements. Actual results may differ materially from those expressed.This report is confidential and intended solely for the use of authorized personnel.— 7 —123
Agent: Verify the total revenue claim against the table
REGIONS
1Revenue KPI[32,172,140,48]
2Quarterly Data Table[32,420,476,120]
3Executive Summary[32,100,476,60]
EXECUTION
LOCATE
Find revenue claim in executive summary at [32,92,476,60]
EXTRACT
Claim: '$71.4M total revenue'
CROSS-REF
Sum table rows: Q4 column = $10.4M + $7.8M + $6.0M = $24.2M
VERIFY
Annual sum = $12.4M+$16.1M+$18.7M+$24.2M = $71.4M ✓ Matches claim

3. Visual Question Answering

The model's API natively supports visual question answering. You can ask questions about a document and get grounded answers with supporting evidence from the page.

FY2026 Annual Revenue ReportVantage Systems · Published March 2026 · ConfidentialEXECUTIVE SUMMARYAnnual revenue reached $71.4M, representing a 34% year-over-year increase driven bystrong enterprise adoption and expansion into APAC markets. Gross margins improved to68.2%, up from 61.7% in FY2025, as the shift to SaaS recurring revenue accelerated.Customer acquisition cost decreased 18% while net retention exceeded 125%.REVENUE$71.4M+34% YoYGROWTH34.2%vs 22% priorGROSS MARGIN68.2%+6.5ptsQUARTERLY REVENUE$12.4MQ1$16.1MQ2$18.7MQ3$24.2MQ4ANALYSISQ4 acceleration was driven bythree enterprise deals exceeding$1M ARR each. Pipeline forFY2027 Q1 shows continuedmomentum in healthcare andfintech verticals.KEY INSIGHTNet revenue retention hit 125%,meaning existing customers grewspend by 25% on average.SegmentQ1Q2Q3Q4Enterprise$5.2M$6.8M$8.1M$10.4MMid-Market$4.1M$5.3M$6.0M$7.8MSMB$3.1M$4.0M$4.6M$6.0M¹ Revenue figures audited by Deloitte & Touche LLP. All amounts in USD.² Growth metrics calculated on a constant-currency basis.VANTAGEThis document contains forward-looking statements.This report is confidential and intended for authorized personnel.— 7 —
Query:

Fine-tuned on edge cases

We’ve been working in the space since the last 7 years, and have repeatedly seen the same edge cases where OCR fails. Nanonets OCR-3 is extensively fine-tuned on these edge cases:

  1. OCR-3 parses simple tables in markdown and complex tables in HTML. It preserves colspan / rowspan in merged cells, does not flatten nested tables, preserves indentation level as metadata, and retains structure perfectly on sparse tables.
  2. Complex table example 1Complex table example 2
  3. Trained and tested to perform context-aware parsing on complex documents ensuring accurate layout extraction and reading order.
  4. Reading order layout example
  5. Extensive fine-tuning runs on W-2, W-4, 1040, etc. to ensure 99%+ extraction accuracy on forms.
  6. The model API is integrated with OCR engines for deterministic, character-level accuracy on complex numbers and dates where pure VLMs are prone to hallucinations.

NanoIndex: A New Vectorless RAG Framework

Bad document ingestion is the #1 reason RAG pipelines fail. NanoIndex fixes this. Vectorless, context-aware, layout-aware, with pixel level precision.

How it works

  1. Run Nanonets OCR-3 on your documents.
  2. OCR-3 extracts structured markdown, hierarchy, tables, and bounding boxes.
  3. A deterministic tree builder turns that into a navigable tree. Zero LLM calls.
  4. Your downstream LLM navigates the tree, returns answers with citations and page numbers.

Why OCR-3

The pipeline depends on extraction quality. OCR-3 provides in a single API call: structured markdown with headings preserved, table-of-contents and hierarchy detection, tables as structured data, bounding boxes per element for page-level citations, and layout understanding that distinguishes headers from body text.

NanoIndex document ingestion demo

NanoIndex in action. Full source code and examples dropping next week.

Devlog

Compared to v2, our new model is bigger (35B parameters), yet it is faster and cheaper.

Here’s how we achieved it:

1. Token Efficiency

Training on full-resolution images the whole time is expensive and, it turns out, unnecessary. We trained 75% of the time on low-resolution images and 25% on full resolution. The model performs the same as training on 100% full resolution.

Most of what a document model needs to learn is structural. Where are the tables? What’s a header versus body text? How do columns relate? Low-resolution images teach this fine.

High-resolution matters for character-level detail, but you don’t need to see every character in crispy detail to learn layout.

At inference, the model caps token usage at 1280 tokens per image. This results in predictable latency and costs without degradation in accuracy.

2. MoE architecture

The new architecture is a Mixture-of-Experts model.

A standard transformer runs every parameter on every token. MoE activates only the relevant experts - 2 or 3 sub-networks out of many.

For us, this meant 2x faster inference compared to the previous dense model at equivalent quality. The experts specialized on their own and we didn’t need to design that in.

3. Preventing catastrophic forgetting

We realized that this is a real phenomenon first-hand. Fine tune too hard on one domain and the model forgets everything else.

We used frozen backbone layers, EWC regularization, replay buffers (15% of each batch was non-document data), and alternating OCR/general-knowledge training phases.

The model still handles general tasks accurately, which matters for VQA, agents, etc.

Lastly, we trained the model on over 11 million documents in just under a month, and to do this, we used optimization techniques like:

  1. Gradient checkpointing - instead of storing all intermediate activations in memory during the forward pass, we recomputed them during backprop. It was slower compute, but it cut memory by roughly 60–70%. This let us fit larger batch sizes on the same hardware.
  2. Mixed precision training - forward and backward passes in FP16, weight updates in FP32. This means we use half the memory bandwidth for most of the work, with FP32 precision for where it matters (such as numerical stability).
  3. Distributed training - model and data parallelism across GPU clusters.
  4. Learning rate scheduling - warmup for the first few thousand steps, then cosine decay. The warmup prevents the early instability you get when gradients are large and the model is far from any sensible solution. Cosine decay lets the model settle into a good minimum rather than oscillating.

None of this is novel but getting it tuned correctly for our specific architecture took almost a week.

We'll release a full-length technical blog on our methodology soon. Stay tuned if you're interested.

Get API Key →Documentation