Introduction
Interest in the field of OCR document processing has grown significantly with back-to-back releases from new market entrants. The latest being Mistral releasing its OCR model with the claim of being cheaper and more accurate than older players and Andrew NG releasing an agentic document extraction product. However, many enterprises struggle to separate valid claims from exaggerated ones. With so many new releases, it can be difficult to identify solutions that truly meet production-level requirements.
Why Benchmarks Matter
Benchmarks provide a structured method to compare and evaluate solutions, helping enterprises filter out unsuitable options, identify tools aligned with their data and operational needs, and streamline validation by reducing the number of products to review. However, a valuable benchmark must align with your organization's real-world challenges. Key considerations include:
- Dataset Relevance: Does the benchmark dataset reflect the types of documents you handle, such as invoices, receipts, or contracts? Does it account for factors like language, format (scanned vs. digital PDFs), length, and real-world imperfections?
- Task Completeness: Does the benchmark evaluate all stages of your document extraction process? Does it align with your goals, whether extracting structured data, performing OCR, or enabling enterprise-wide search?
Limitations in Current Benchmarks
|
CC-OCR | 7,058 | ✓ | ✓ | |
OCRBench | 1,000 | ✓ | ✓ | | |
DocILE Test Set | 1,000 | | ✓ | | |
BuDDIE | 1,665 | | ✓ | | |
KOSMOS2.5-Eval | 7,990 | ✓ | | | |
FOX | 612 | ✓ | | | |
DocLocal4K | 4,250 | ✓ | | | |
Omni AI OCR | 1,000 | | | ✓ | |
Reducto Rdbench | 1,000 | | | ✓ | |
Mistral AI | 1,000 | | | ✓ | |
We reviewed several popular document processing benchmarks. Each benchmark addresses specific aspects of document processing:
- OCR (Optical Character Recognition): Converts images or scanned documents into unstructured machine-readable text.
- Key Information Extraction: Identifies and extracts specific data fields (e.g., names, dates, amounts) from documents.
- Markdown Generation: Formats extracted text into structured markdown for easier readability and processing.
However, none of these benchmarks focus on automation, which involves minimizing manual intervention.
Benchmarking Automation
Automation can be benchmarked using confidence scores, which indicate the model's certainty about its predictions. By setting confidence thresholds, we can measure the proportion of data that a model can accurately handle without human intervention. This approach helps objectively compare the performance of different models in terms of their automation capability. The code to replicate this benchmarking process is available publicly on
GitHub.
Dataset
We have collected 1000 images from open-source datasets with common document types like invoices, receipts, passports, and bank statements. Creating accurate ground truths of structured data is expensive but essential to maintain the integrity of the benchmark. We have annotated 16,639 data points and shared it publicly on
Hugging Face.
Methodology
Confidence scores are essential to know what to manually review vs what can be trusted. Nanonets natively supports confidence scores, allowing direct precision reporting. As general purpose LLMs do not natively provide confidence scores, we estimate confidence scores using the bellow methods:
- Logits: Confidence derived from raw logits of predictions.
- Consistency: Repeated queries to the LLM assessing response consistency.
- Numeric: Ask the LLM for a numeric confidence estimate.
- Binary: Ask the LLM for a binary confidence estimate (High/Low).
Results

Most LLMs fail to achieve any automation at 98% precision. The results are better at 90% precision, but 90% precision is not enough to automate human work. Detailed findings for each method are shared below.

- While general purpose LLMs perform well on overall accuracy, they struggle to provide reliable confidence scores.
- Gemini 2.0 Flash is the only general purpose LLM that reached 98% precision, but it could only automate 8% of the data.
- OpenAI’s GPT4o and Claude Sonnet are unable to reach 95% precision.
Implications for Enterprises
Enterprises looking to automate document processing need more than raw accuracy. Without dependable confidence scores, each prediction still demands human review. By emphasizing "automation at 98% precision," this benchmark aims to identify solutions that can genuinely reduce manual work.
Future of this Benchmark
We plan to expand this benchmark by including more document types and exploring additional confidence estimation methods. To learn more or suggest new data categories, please write to us at
research@nanonets.com