Document Processing Automation Benchmark

Introduction

Interest in the field of OCR document processing has grown significantly with back-to-back releases from new market entrants. The latest being Mistral releasing its OCR model with the claim of being cheaper and more accurate than older players and Andrew NG releasing an agentic document extraction product. However, many enterprises struggle to separate valid claims from exaggerated ones. With so many new releases, it can be difficult to identify solutions that truly meet production-level requirements.

Why Benchmarks Matter

Benchmarks provide a structured method to compare and evaluate solutions, helping enterprises filter out unsuitable options, identify tools aligned with their data and operational needs, and streamline validation by reducing the number of products to review. However, a valuable benchmark must align with your organization's real-world challenges. Key considerations include:
  • Dataset Relevance: Does the benchmark dataset reflect the types of documents you handle, such as invoices, receipts, or contracts? Does it account for factors like language, format (scanned vs. digital PDFs), length, and real-world imperfections?
  • Task Completeness: Does the benchmark evaluate all stages of your document extraction process? Does it align with your goals, whether extracting structured data, performing OCR, or enabling enterprise-wide search?

Limitations in Current Benchmarks

Benchmark# Docs OCRKey Information Extraction Markdown Generation Automation
CC-OCR 7,058
OCRBench1,000
DocILE Test Set1,000
BuDDIE1,665
KOSMOS2.5-Eval7,990
FOX612
DocLocal4K4,250
Omni AI OCR1,000
Reducto Rdbench 1,000
Mistral AI1,000
We reviewed several popular document processing benchmarks. Each benchmark addresses specific aspects of document processing:
  • OCR (Optical Character Recognition): Converts images or scanned documents into unstructured machine-readable text.
  • Key Information Extraction: Identifies and extracts specific data fields (e.g., names, dates, amounts) from documents.
  • Markdown Generation: Formats extracted text into structured markdown for easier readability and processing.
However, none of these benchmarks focus on automation, which involves minimizing manual intervention.

Benchmarking Automation

Automation can be benchmarked using confidence scores, which indicate the model's certainty about its predictions. By setting confidence thresholds, we can measure the proportion of data that a model can accurately handle without human intervention. This approach helps objectively compare the performance of different models in terms of their automation capability. The code to replicate this benchmarking process is available publicly on GitHub.

Dataset

We have collected 1000 images from open-source datasets with common document types like invoices, receipts, passports, and bank statements. Creating accurate ground truths of structured data is expensive but essential to maintain the integrity of the benchmark. We have annotated 16,639 data points and shared it publicly on Hugging Face.

Methodology

Confidence scores are essential to know what to manually review vs what can be trusted. Nanonets natively supports confidence scores, allowing direct precision reporting. As general purpose LLMs do not natively provide confidence scores, we estimate confidence scores using the bellow methods:
  • Logits: Confidence derived from raw logits of predictions.
  • Consistency: Repeated queries to the LLM assessing response consistency.
  • Numeric: Ask the LLM for a numeric confidence estimate.
  • Binary: Ask the LLM for a binary confidence estimate (High/Low).

Results

Most LLMs fail to achieve any automation at 98% precision. The results are better at 90% precision, but 90% precision is not enough to automate human work. Detailed findings for each method are shared below.
    • While general purpose LLMs perform well on overall accuracy, they struggle to provide reliable confidence scores.
    • Gemini 2.0 Flash is the only general purpose LLM that reached 98% precision, but it could only automate 8% of the data.
    • OpenAI’s GPT4o and Claude Sonnet are unable to reach 95% precision.

    Implications for Enterprises

    Enterprises looking to automate document processing need more than raw accuracy. Without dependable confidence scores, each prediction still demands human review. By emphasizing "automation at 98% precision," this benchmark aims to identify solutions that can genuinely reduce manual work.

    Future of this Benchmark

    We plan to expand this benchmark by including more document types and exploring additional confidence estimation methods. To learn more or suggest new data categories, please write to us at research@nanonets.com